Can BWA (mem) detect longer deletions?
3
0
Entering edit mode
7 months ago
weixiaokuan ▴ 140

Hi,

I am trying to use BWA to align my NGS data to a reference genome. My NGS data may come from very long deletions from specific regions of the genome. I wonder if BWA or its associated can detect such long deletions. Are there any parameters that I have to tune? Or is there a limit for BWA to detect long deletions?

Thank you!

-Xiaokuan

alignment bwa • 1.6k views
ADD COMMENT
0
Entering edit mode

Hi,

Thank you for replying to my question. I may need to clarify this a bit. In general, when we align reads to a reference using BWA it will output a CIGAR string in alignment result (SAM file). This string will indicate the deletion of aligned reads. I am wondering if my reads are long enough, for example, 2 x 300 bp paired, will BWA be able to detect or report deletion about 100 bp or longer? I don't want to make variant calls or involve other tools, just to understand BWA's capabilities.

Thank you.

-Xiaokuan

ADD REPLY
2
Entering edit mode

100bp deletion is not that long. You can pair BWA with appropriate callers to get variants of that size. So as to your question about "capability", yes, BWA is able to do so. If you are asking if BWA has bias against longer deletions, sure it does because of gap extension penalties.

ADD REPLY
3
Entering edit mode
7 months ago
rfran010 ▴ 900

A variant caller will infer by comparing alignments to a reference sequence, so you could still detect longer deletions.

However, to answer your other question directly, you probably won't see these pop-up in the cigar string since long deletions would be more equivalent to intronic sequences. To my knowledge, BWA is not optimized for this type of alignment, which is why one reason we don't usually see it used for RNA-seq alignments.

Instead, BWA is a local aligner, meaning if there is an aligned read that spans a long deletion, then it is likely that only a part of the read will align and the rest will be soft-clipped. However, you may be able to tweak the parameters to try and capture a longer deletion within the cigar string, but there's probably an upper limit to this, I think 100 bp may be doable with 300bp reads, but I've never looked into this.

At least in theory, you could consider STAR which may more accurately map these reads and report the spanning of longer junctions in the CIGAR string.

ADD COMMENT
2
Entering edit mode

Good comment, I moved this to answer.

ADD REPLY
1
Entering edit mode

Thank you, @fran010. Your point was what I wanted to discuss. Thank you for your insight! It's very helpful. -Xiaokuan

ADD REPLY
1
Entering edit mode
6 months ago

When you expect long deletions in NGS data, I recommend BBMap; it's specifically designed for large deletions in short reads. The defaults are usually fine but in this case you may want to add the flag "maxindel=800k" to allow alignment to deletion events of 800000bp, which will work fine with 300bp reads. Its accompanying variant caller (CallVariants) is also designed to call indels directly from alignments rather than from inference.

bbmap.sh ref=ref.fa in=reads.fq out=mapped.sam maxindel=400k
callvariants.sh in=mapped.sam ref=ref.fa out=vars.vcf

You can use an inference-based mapper/caller pair instead, but they typically won't get the long deletion bounds quite right.

ADD COMMENT
0
Entering edit mode
7 months ago
cmdcolin ★ 3.8k

minimap2 may be unique qualified to do this, particularly aligning long reads. Quote from minimap2 paper "Now minimap2 v2.22 can more accurately map long reads to highly repetitive regions and align through insertions or deletions up to 100kb by default".

note: if you are using e.g. paired end short reads, its likely you would be looking for "longer than expected.distance between pairs" for large deletions though

minimap2 paper https://arxiv.org/abs/2108.03515

ADD COMMENT

Login before adding your answer.

Traffic: 1705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6