Which Aligners Recognize Soft-Masked Repeats In Reference Sequences?
3
9
Entering edit mode
11.2 years ago

Which aligners (long and short read) behave differently when parts of a reference/target sequence are "soft-masked", i.e. have portions in lowercase to designate repeat regions?

alignment • 9.9k views
14
Entering edit mode
11.2 years ago
lh3 32k

No, do not align to masked genome for any purpose. Filter out the reads mapped to the masked region after whole-genome alignment.

2
Entering edit mode

Masking has never been perfect and probably will never be perfect. This will lead to wrongly mapped sequences, spurious SNPs/indels calls and all sorts of problems. I cannot think a single use case when masking may lead to better outcomes. Trust me. Do not mask.

1
Entering edit mode

Yup. Do not mask. You get the most accurate alignment when you align to what is actually there. What you do not want are reads that really belong to repetitive regions being forced to align to the wrong place because you didn't provide the correct sequence for the read to align to.

bwa does not care about lowercase nucleotides.

0
Entering edit mode

What will be a difference, except for paired-ends or spliced mappings?

0
Entering edit mode

so I assume BWA does not care about lowercase nucleotides?

0
Entering edit mode

BWA always uses all bases in alignment. Again, do not mask, unless you want to play with troubles.

5
Entering edit mode
11.2 years ago

LASTZ, 'soft-masked' regions are NOT available for seeding but allow extension. It also allows you to specify a separate file for the intervals to mask (with softmask=<mask_file>).

1
Entering edit mode
8.8 years ago

FSA also takes into account soft-masked regions when supplied with --softmasked option.