OK. I realize this is not exactly a bioinformatics question but I know that a lot of people in this forum spend their days staring at NGS alignments and am hoping someone has an explanation or some insight.
See the IGV screenshot below of representative matched tumor and normal samples. The pattern shown is VERY characteristic of the problem and consistent throughout the entire genome. The data are from whole genome sequencing (WGS) on the Illumina X10 platform and aligned with BWA-MEM to GRC37. The symptoms are (1) very uneven coverage with valleys and peaks that seem to correlate perfectly with absence or presence of Alu elements respectively; and (2) unusually high discordant read pair rates (5-25%). Note the region to the right with few Alu sequences looks reasonably good although less than the targeted coverage given how many reads are burned up in the problematic spikes around Alu elements. Also, note the IGV session is colored by insert size and pair orientation. The different colors represent reads where the mate pair is aligned to a different chromosome or unexpected fragment size. The alignments of both mates are generally good as if encompassing a real translocation but no consistency in breakpoints and with too many diverse discrepancies for even the most rearranged genome. We have seen this in two projects now with externally provided DNA. Both projects were 50-100 samples where the problem was quite consistent (with minor exceptions) across the sample set. It does not correlate with instrument, lane, or flowcell as other samples (from other projects) that were pooled with these samples had no problems. Repeated sequencing on different instrument/platform and even new library constructions (with different kits - Kapa, SWIFT, etc) produced data with identical characteristics. The only thing that has helped is entirely new DNA preps from source materials. In the first project we went back to original tumor/normal tissues and prepared new DNA preps in our own lab and the problem went away entirely. That is a solution (although expensive to basically repeat all work and sequencing) but I would like to understand the root/underlying cause.
Has anyone seen this? Any insight as to what might be going wrong in sample prep to cause this apparent enrichment in regions of the genome coinciding with Alu elements? Google has failed me so far. Let me know if there is somewhere else that I might try posting this.
Thanks for the comments!
I don't think this has to do with a general Alu repeat low complexity, low mapping quality issue. The reads in these peaks and both reads of the discordant pairs almost entirely have normal (very good) mapping qualities. In any case, if it just had to do with mapping issues between Alu elements I would expect these patterns to be more common across projects, not specific to these two projects. To give some context I have looked at literally thousands of other samples from dozens of projects without this issue.
Copy number variation of Alu elements is an interesting idea. This might explain a rare sample having this pattern but not 80-90% of samples in one project from ~50 different individuals. Also. There is a huge diversity in Alu elements. While they certainly share some basic similarities it is not conceivable to me that focal amplifications of some Alu elements would lead to such a genome-wide pattern of increased coverage at all Alu elements with good mapping qualities.
Finally, as you say, I don't see how a new DNA prep would solve the problem. I think the only explanation is some molecular biology gone wrong at the point of sample prep. Maybe some kind of contamination? An amplification protocol that we weren't informed of? I don't have enough knowledge/experience with that to guess what it would be. I'm hoping someone sees this post and recognizes the problem though. Thanks for thinking it through with me!