Expanding reference genome with alt loci and decoy sequences
1
1
Entering edit mode
3 months ago
appropiate ▴ 80

If I wanted to expand a reference genome fasta file with alternate loci and decoy fasta files in order to get better read alignments, would it be enough with merging the files (with, say, cat *.fa > expanded_genonme.fa) before starting the typical mapping workflow (samtools, picard, bwa ...), or there are additional steps to take care of?

read alt reference decoy genome alignment • 609 views
1
Entering edit mode
3 months ago

Actually, your alignments will be worse when you expand a reference genome with alternate loci, see e.g. this recent comment by ATpoint and this blog post by Heng Li. The latter also highlights some issues with commonly used reference genomes, but the lack of alternate loci is a feature, not a bug ;-)

0
Entering edit mode

Thanks! I have replied to that recent comment since it has more participants. Please feel free to comment there as well!

1
Entering edit mode

Neither what ATpoint nor I wrote, contradicts this tutorial in any way. Furthermore, I trust, it is a good one when written by the GATK team.

But in your original post, you stated that you are going to start the typical mapping workflow (samtools, picard, bwa ...), afterwards and alt-ware mapping is generally not considered to be the typical mapping workflow. Most aligners are by default not alternate contig aware (alt-aware) and will thus assign poor mapping quality scores to many reads due to erroneously considering them being multi-mappers.

Nothing wrong with alt contigs, as long as you are using the right tools and it's a deliberate choice.

0
Entering edit mode

Thanks! So regarding my original question, could I just merge the fasta files and follow that tutorial?

0
Entering edit mode

Yes, technically you can append any sequence to the reference followed by building a new index. If that makes sense or is appropriate is a different question.

0
Entering edit mode

My criteria for whether it makes sense would be comparing MAPQ values for alignments when using reference genome vs expanded genome. If there are fewer alignments with MAPQ = 0 in the latter, I would think it makes sense. I am way off here?