Question

Calculating genetic distance when some sites are ambiguous (heterozygotes)

0

Entering edit mode

24 months ago

emacdoug • 0

I'm constructing a matrix of pairwise genetic distance based on Sanger sequence data. The samples are diploid with many variable sites, so there are a bunch of legitimate ambiguous characters (R, M, W etc) in the DNA sequences. I'd like to calculate distances making use of this information, such that [for example] AAA and ARA have a pairwise distance greater than zero but less than the pairwise distance between AAA and AGA. That is, it makes sense to me that heterozygotes should have intermediate genetic distance between both types of homozygote.

I've tried dist.alignment in seqinR and dist.dna in ape, but they both seem to be dropping the ambiguous characters as missing data. Ideas on how I can fix this, or other commands/packages to try, would be so welcome!!

heterozygous distance ape R genetic • 751 views

ADD COMMENT • link updated 13 months ago by Ben Anderson • 0 • written 24 months ago by emacdoug • 0

score 0 · Answer 1 · 2023-04-28

How about using MATCHSTATES distances (or GENPOFAD) in the package pofadinr by Joly et al. (https://github.com/simjoly/pofadinr) ? That will provide a distance when using ambiguity codes.

library(ape)
library(pofadinr)

alignment <- read.FASTA("input.fasta", type = "DNA")

# convert unknown bases to "?"
temp <- as.character(alignment)
temp[temp == "n"] <- "?"
alignment <- as.DNAbin(temp)

distances <- dist.snp(alignment, model = "MATCHSTATES")