Removing gaps from a multiple sequence alignment in order to analyse SNPs
1
0
Entering edit mode
7.9 years ago
natasha • 0

Hi

I have aligned ~50 bacterial whole genomes with the intention of analysing SNPs to construct their phylogeny.

My alignment has both '-' and 'N' present. I was intending on using trimAL to delete any gaps (e.g. '-') from my alignment before undertaking SNP analysis. I just want to check that this is a good idea? Also as I also have 'N's present, will this affect my SNP analysis and if so what should I do?

Thanks!

trimAL SNP alignment • 2.3k views
ADD COMMENT
1
Entering edit mode
7.9 years ago
DG 7.3k

Depending on how closely related your genomes are, removing sites that have any gaps in them may be removing a ton of truly informative sites. I typically do phylogenies via phylogenomics or multi-gene alignments myself, but we wouldn't have much data if we removed all of the gap containing sites. Can the programs based on SNP input alone handle missing data? If they can (and they should) you'll have better overall resolution by leaving gaps in. Removing regions that are poorly aligned is a different story of course, because you don't want to introduce noise and false-positive signal to the analysis, and the sites at the edge of large "gappy" regions are often misaligned. Similarly for N containing sites, it depends on if your downstream analysis program can handle ambiguous data. If it can then leaving them in doesn't really hurt. If the program doesn't handle N's but does handle gap characters you can typically replace N's with -'s in your alignment. Most phylogeny programs treat the two in the same way when reconstructing the tree.

ADD COMMENT

Login before adding your answer.

Traffic: 2553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6