Hello,
We have been trying to develop a target capture sequencing of around one hundred regions in the human genome. The probes capture a region of approximately 500bp around some genetic variants of interest.
Our bioinformatics pipeline uses bwa-mem
for aligning the captured reads to a custom genome reference. These "custom" genome consists of ~2000 bp regions around the genetic variants of interest that we are capturing. Thus, the custom genome reference has around 200000 bps. We then proceed to do variant calling, using GATK4. With the current pipeline, we have more than 10% of the captured regions, producing bad quality genotyping calls.
I have been revising the pipeline, as my intuition was telling me that aligning the whole capture reads to a 200000 bp genome would be problematic and prone to artefacts. By quickly inspecting the aligned bam files, I could observe the following:
On the top track, I have a whole genome sequencing sample, aligned to the reduced genome with 101 regions of 2000 bps. On the bottom, the same sample processed for target-capture sequencing, aligned to the same "genome", showing with clear peaks around the capture regions (1000 bp and - 3900 bp). As you can appreciate, both samples have further peaks that seem to be reads with lots of mismatches (one immediately to the right of the capture peak at 1000 bp, another one at 4100 bp). My immediate thought, was that these corresponded to some repetitive regions and that regions derived from similar class of repetitive elements were being misaligned there.
To test this hypothesis, I proceed to perform alignment with bwa-mem
against the whole human genome. This gives the following results:
This gives seemingly a clean and "normal" plot for the distribution of reads across the regions that we are analysing. For the whole genome sequencing, you see a consistent number of reads per position. For the target-capture sequencing you see two very clean peaks corresponding to the target regions. Moreover, you can see that the peaks from the "reduced" alignment correspond to two Alu elements, one of the most abundant class of repeats in the human genome.
This was the confirmation I needed to show that the current pipeline is subpar and not fit to use for our needs. What I am struggling with, is finding a solution for this issue.
The potential solutions I thought of are:
- Perform alignment to the whole human genome. This might be an issue, as we have close to 1000 samples, and this would be very taxing in terms of resources and time);
- Perform
bwa-mem
alignment with stricter parameters. Unfortunately, given the nature of the algorithm (seed and extend), it will always end up forcing alignment to regions that it thinks are the "best" targets. Even if the reads are clearly being misaligned. - Perform filtering downstream of the alignment with
samtools view
. I tried filtering for MAPQ > 30 and percentage of mismatches < 3%. Wanted to try to get rid of clipped reads as well. Sadly, so far, even if I can clean reads a bit, there's still a big amount of misaligned reads that remain; - Create a new "reduced" genome, with increased sizes around target-capture regions (e.g. 10 kb, 50 kb, 100 kb), to dampen the number of misaligned regions around the capture regions;
- Add the X chromosome to the panel of 101 regions of 2000 bps, to dampen the number of misaligned regions around the capture regions
I thought of the two last strategies, in order to find a solution that would not compromise our available resources and time of analysis, while reducing the misalignment of reads close to the target capture regions.
Sorry for the long post. But I really would appreciate some ideas from the community! Thank you in advance for your help!
Best wishes, Miguel
yes. Exome Sequencing: Masking The Non-Genic Sequences ?
if you want a "quick' result, pipe the output of bwa in samtools -L
Thank you for your answer. If I got it right, your solution is to perform whole genome alignment, and then extract the reads mapping to the target regions, right? My main issue here, is how long it takes to perform the whole genome alignment of a single sample (WGS with 30-50X coverage).
Cheers
it depends of the number of threads you're using. But you could have the output withing a day.
Hello again! Following the GATK4 pipeline, I have to produce a uBAM, an aligned BAM, merge both, mark duplicated and sort (all using the GATK/Picard tools... which are really not super efficient at using multiple threads!). This takes almost a full day to perform... for a single WGS sample.
What I am trying to come up with, is with an alternative way to align the samples. What I've tried so far is:
What I am thinking is that I need to find sequences to add to the reference genome, that I could use as a "bait" to clean the misaligned reads. Not sure this makes sense?
why ?
you can use samtools sort, sambamba, etc...
bwa mem2 , dragmap ...
Mostly, because we are following the best practices workflow by GATK: https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-
I asked to change the tools used prior to the variant calling step... but my colleagues are very strict on following the GATK pipeline. :|
That's why I'm trying to find alternatives to aligning to the whole genome (which, admittedly, is the best strategy!).