We have been trying to develop a target capture sequencing of around one hundred regions in the human genome. The probes capture a region of approximately 500bp around some genetic variants of interest.
Our bioinformatics pipeline uses
bwa-mem for aligning the captured reads to a custom genome reference. These "custom" genome consists of ~2000 bp regions around the genetic variants of interest that we are capturing. Thus, the custom genome reference has around 200000 bps. We then proceed to do variant calling, using GATK4. With the current pipeline, we have more than 10% of the captured regions, producing bad quality genotyping calls.
I have been revising the pipeline, as my intuition was telling me that aligning the whole capture reads to a 200000 bp genome would be problematic and prone to artefacts. By quickly inspecting the aligned bam files, I could observe the following:
On the top track, I have a whole genome sequencing sample, aligned to the reduced genome with 101 regions of 2000 bps. On the bottom, the same sample processed for target-capture sequencing, aligned to the same "genome", showing with clear peaks around the capture regions (1000 bp and - 3900 bp). As you can appreciate, both samples have further peaks that seem to be reads with lots of mismatches (one immediately to the right of the capture peak at 1000 bp, another one at 4100 bp). My immediate thought, was that these corresponded to some repetitive regions and that regions derived from similar class of repetitive elements were being misaligned there.
To test this hypothesis, I proceed to perform alignment with
bwa-mem against the whole human genome. This gives the following results:
This gives seemingly a clean and "normal" plot for the distribution of reads across the regions that we are analysing. For the whole genome sequencing, you see a consistent number of reads per position. For the target-capture sequencing you see two very clean peaks corresponding to the target regions. Moreover, you can see that the peaks from the "reduced" alignment correspond to two Alu elements, one of the most abundant class of repeats in the human genome.
This was the confirmation I needed to show that the current pipeline is subpar and not fit to use for our needs. What I am struggling with, is finding a solution for this issue.
The potential solutions I thought of are:
- Perform alignment to the whole human genome. This might be an issue, as we have close to 1000 samples, and this would be very taxing in terms of resources and time);
bwa-memalignment with stricter parameters. Unfortunately, given the nature of the algorithm (seed and extend), it will always end up forcing alignment to regions that it thinks are the "best" targets. Even if the reads are clearly being misaligned.
- Perform filtering downstream of the alignment with
samtools view. I tried filtering for MAPQ > 30 and percentage of mismatches < 3%. Wanted to try to get rid of clipped reads as well. Sadly, so far, even if I can clean reads a bit, there's still a big amount of misaligned reads that remain;
- Create a new "reduced" genome, with increased sizes around target-capture regions (e.g. 10 kb, 50 kb, 100 kb), to dampen the number of misaligned regions around the capture regions;
- Add the X chromosome to the panel of 101 regions of 2000 bps, to dampen the number of misaligned regions around the capture regions
I thought of the two last strategies, in order to find a solution that would not compromise our available resources and time of analysis, while reducing the misalignment of reads close to the target capture regions.
Sorry for the long post. But I really would appreciate some ideas from the community! Thank you in advance for your help!
Best wishes, Miguel