Uneven coverage correlated with Alu sequences (and discordant read pairs) in NGS data
2
5
Entering edit mode
5.7 years ago

OK. I realize this is not exactly a bioinformatics question but I know that a lot of people in this forum spend their days staring at NGS alignments and am hoping someone has an explanation or some insight.

See the IGV screenshot below of representative matched tumor and normal samples. The pattern shown is VERY characteristic of the problem and consistent throughout the entire genome. The data are from whole genome sequencing (WGS) on the Illumina X10 platform and aligned with BWA-MEM to GRC37. The symptoms are (1) very uneven coverage with valleys and peaks that seem to correlate perfectly with absence or presence of Alu elements respectively; and (2) unusually high discordant read pair rates (5-25%). Note the region to the right with few Alu sequences looks reasonably good although less than the targeted coverage given how many reads are burned up in the problematic spikes around Alu elements. Also, note the IGV session is colored by insert size and pair orientation. The different colors represent reads where the mate pair is aligned to a different chromosome or unexpected fragment size. The alignments of both mates are generally good as if encompassing a real translocation but no consistency in breakpoints and with too many diverse discrepancies for even the most rearranged genome. We have seen this in two projects now with externally provided DNA. Both projects were 50-100 samples where the problem was quite consistent (with minor exceptions) across the sample set. It does not correlate with instrument, lane, or flowcell as other samples (from other projects) that were pooled with these samples had no problems. Repeated sequencing on different instrument/platform and even new library constructions (with different kits - Kapa, SWIFT, etc) produced data with identical characteristics. The only thing that has helped is entirely new DNA preps from source materials. In the first project we went back to original tumor/normal tissues and prepared new DNA preps in our own lab and the problem went away entirely. That is a solution (although expensive to basically repeat all work and sequencing) but I would like to understand the root/underlying cause.

Has anyone seen this? Any insight as to what might be going wrong in sample prep to cause this apparent enrichment in regions of the genome coinciding with Alu elements? Google has failed me so far. Let me know if there is somewhere else that I might try posting this.

NGS WGS sequencing alignment • 2.2k views
0
Entering edit mode
5.7 years ago

Hi !

I don't find this pattern very surprising given that :

1 - Alu elements are repeated sequences : a read originating from one Alu repeat can very well map to another repeat. What is the mapping quality for those reads ? This can explain the high discordant pairing rate (one read mapping to one repeat, its mate to another repeat).

2 - Alu elements are subject to copy number variation (ref). There could be more Alu elements in the genome of your sample than in your reference genome, leading to higher coverage.

This, of course, doesn't explain why going back to sample preparation changes the patterns...

Hope this helps,

Carlo

2
Entering edit mode

1. I don't think this has to do with a general Alu repeat low complexity, low mapping quality issue. The reads in these peaks and both reads of the discordant pairs almost entirely have normal (very good) mapping qualities. In any case, if it just had to do with mapping issues between Alu elements I would expect these patterns to be more common across projects, not specific to these two projects. To give some context I have looked at literally thousands of other samples from dozens of projects without this issue.

2. Copy number variation of Alu elements is an interesting idea. This might explain a rare sample having this pattern but not 80-90% of samples in one project from ~50 different individuals. Also. There is a huge diversity in Alu elements. While they certainly share some basic similarities it is not conceivable to me that focal amplifications of some Alu elements would lead to such a genome-wide pattern of increased coverage at all Alu elements with good mapping qualities.

Finally, as you say, I don't see how a new DNA prep would solve the problem. I think the only explanation is some molecular biology gone wrong at the point of sample prep. Maybe some kind of contamination? An amplification protocol that we weren't informed of? I don't have enough knowledge/experience with that to guess what it would be. I'm hoping someone sees this post and recognizes the problem though. Thanks for thinking it through with me!

0
Entering edit mode
5.7 years ago

It's intriguing that making a fresh DNA prep solves the issue. If I'm not mistaken Alu repeats are more AT-rich than average, which makes me think that something in the buffer of those samples influenced amplification efficiency of GC-rich sequences, relatively in favour of AT-rich sequences including Alu's.

The easiest way to investigate this would probably be to create (~100bp) windows, calculate GC content and plot this against average coverage per window, compare between "good" and "bad" samples and see if the correlation is skewed.

1
Entering edit mode

Or more simply, use computeGCBias from deepTools :)

0
Entering edit mode

Oh right, the wheel already exists ;-)

1
Entering edit mode

You are indeed correct that GC/AT content correlates with Alu elements. In general Alu elements seem to be correlated with more GC-rich regions, although it depends on the type/age of the Alu element (and is probably more complicated than that). Indeed the very top track in the IGV snapshot shows GC percentage that is I believe calculated in much the same way as you propose. You can see that there is a general positive correlation between GC and Alu elements. I like your theory that inconsistent amplification efficiency is at play here. But, amplification typically occurs at library construction or in cluster formation and repeating these steps did not help. Typically we would not do amplification at the DNA isolation step, unless there were issues with DNA quantity that necessitated it. So, I guess one possible explanation is that some kind of whole genome amplification was done after DNA isolation and we were not informed of it. Then, when we requested the samples to perform our own isolation, without amplification, the problem went away. Its strange that we would have now 2 or 3 projects with this issue where we were not informed of something so fundamental to the sample preparation. Though such communication failures are certainly not unheard of. Can anyone out there verify that these alignment patterns are consistent with some kind of amplification protocol at/after DNA isolation but before library construction/sequencing?