This question was originally posted on Bioinformatics Stack Exchange. At this time, I've received no answers there, so I thought I'd cross-post here.
I am using the pairwiseAlignment
function from the Biostrings
package to calculate the distance between a consensus sequence and a set of other sequences. Some example code:
# R v4.2.0
library(Biostrings) # v2.63.3
library(BiocGenerics) # v0.41.2
library(tidyverse) # v1.3.1
# consensus sequence
con <- DNAStringSet(c("GTTGTGATTTGCTTTCRAATTAGTATCTTTGAACCATTGAAAACAAC"))
# subject sequences
seq <- DNAStringSet(c("ATTGTGATTTGCTTTCAAATTAGTATCTTTAAACCATTGAAAACAGC",
"GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC",
"GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC",
"GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC",
"GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC"))
# substitution matrix
sub <- nucleotideSubstitutionMatrix(match = 0, mismatch = 1)
# get alignments
sapply(X = 1:length(seq),
FUN = function(i) {
pairwiseAlignment(pattern = con,
subject = seq[i],
substitutionMatrix = sub) %>%
score
}
) %>% as_tibble %>% mutate(sequence = as.character(seq))
The resulting tibble looks like this:
# A tibble: 5 × 2
value sequence
<dbl> <chr>
1 -3.5 ATTGTGATTTGCTTTCAAATTAGTATCTTTAAACCATTGAAAACAGC
2 -0.5 GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC
3 -0.5 GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC
4 -0.5 GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC
5 -0.5 GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC
Something isn't quite right. Take a look at the alignment of con
against seq[2]
.
con: GTTGTGATTTGCTTTCRAATTAGTATCTTTGAACCATTGAAAACAAC
|||||||||||||||| ||||||||||||||||||||||||||||||
seq[2]: GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC
Note that the only difference is at the position in the consensus sequence where there is ambiguous IUPAC encoding. Because R can represent A or G, the score for seq[2]
should be 0 (a perfect match). If the ambiguous base was instead in seq[2]
and con
was unambiguous, then the score penalty would be appropriate. The discrepancy here is due to the symmetry of the substitution matrix sub
:
A C G T M R W S Y K
A 0.0000000 -1.0000000 -1.0000000 -1.0000000 -0.5000000 -0.5000000 -0.5000000 -1.0000000 -1.0000000 -1.0000000
C -1.0000000 0.0000000 -1.0000000 -1.0000000 -0.5000000 -1.0000000 -1.0000000 -0.5000000 -0.5000000 -1.0000000
G -1.0000000 -1.0000000 0.0000000 -1.0000000 -1.0000000 -0.5000000 -1.0000000 -0.5000000 -1.0000000 -0.5000000
T -1.0000000 -1.0000000 -1.0000000 0.0000000 -1.0000000 -1.0000000 -0.5000000 -1.0000000 -0.5000000 -0.5000000
M -0.5000000 -0.5000000 -1.0000000 -1.0000000 -0.5000000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -1.0000000
R -0.5000000 -1.0000000 -0.5000000 -1.0000000 -0.7500000 -0.5000000 -0.7500000 -0.7500000 -1.0000000 -0.7500000
W -0.5000000 -1.0000000 -1.0000000 -0.5000000 -0.7500000 -0.7500000 -0.5000000 -1.0000000 -0.7500000 -0.7500000
S -1.0000000 -0.5000000 -0.5000000 -1.0000000 -0.7500000 -0.7500000 -1.0000000 -0.5000000 -0.7500000 -0.7500000
Y -1.0000000 -0.5000000 -1.0000000 -0.5000000 -0.7500000 -1.0000000 -0.7500000 -0.7500000 -0.5000000 -0.7500000
K -1.0000000 -1.0000000 -0.5000000 -0.5000000 -1.0000000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.5000000
V -0.6666667 -0.6666667 -0.6666667 -1.0000000 -0.6666667 -0.6666667 -0.8333333 -0.6666667 -0.8333333 -0.8333333
H -0.6666667 -0.6666667 -1.0000000 -0.6666667 -0.6666667 -0.8333333 -0.6666667 -0.8333333 -0.6666667 -0.8333333
D -0.6666667 -1.0000000 -0.6666667 -0.6666667 -0.8333333 -0.6666667 -0.6666667 -0.8333333 -0.8333333 -0.6666667
B -1.0000000 -0.6666667 -0.6666667 -0.6666667 -0.8333333 -0.8333333 -0.8333333 -0.6666667 -0.6666667 -0.6666667
N -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000
V H D B N
A -0.6666667 -0.6666667 -0.6666667 -1.0000000 -0.75
C -0.6666667 -0.6666667 -1.0000000 -0.6666667 -0.75
G -0.6666667 -1.0000000 -0.6666667 -0.6666667 -0.75
T -1.0000000 -0.6666667 -0.6666667 -0.6666667 -0.75
M -0.6666667 -0.6666667 -0.8333333 -0.8333333 -0.75
R -0.6666667 -0.8333333 -0.6666667 -0.8333333 -0.75
W -0.8333333 -0.6666667 -0.6666667 -0.8333333 -0.75
S -0.6666667 -0.8333333 -0.8333333 -0.6666667 -0.75
Y -0.8333333 -0.6666667 -0.8333333 -0.6666667 -0.75
K -0.8333333 -0.8333333 -0.6666667 -0.6666667 -0.75
V -0.6666667 -0.7777778 -0.7777778 -0.7777778 -0.75
H -0.7777778 -0.6666667 -0.7777778 -0.7777778 -0.75
D -0.7777778 -0.7777778 -0.6666667 -0.7777778 -0.75
B -0.7777778 -0.7777778 -0.7777778 -0.6666667 -0.75
N -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.75
As part of a discussion on Github, I provided the code to manually reassign elements of sub
to create an asymmetric substitution matrix:
sub["N", "A"] <- 0
sub["N", "T"] <- 0
sub["N", "G"] <- 0
sub["N", "C"] <- 0
sub["M", "A"] <- 0
sub["M", "C"] <- 0
sub["R", "A"] <- 0
sub["R", "G"] <- 0
sub["W", "A"] <- 0
sub["W", "T"] <- 0
sub["S", "C"] <- 0
sub["S", "G"] <- 0
sub["Y", "C"] <- 0
sub["Y", "T"] <- 0
sub["K", "G"] <- 0
sub["K", "T"] <- 0
sub["V", "A"] <- 0
sub["V", "C"] <- 0
sub["V", "G"] <- 0
sub["H", "A"] <- 0
sub["H", "C"] <- 0
sub["H", "T"] <- 0
sub["D", "A"] <- 0
sub["D", "G"] <- 0
sub["D", "T"] <- 0
sub["B", "C"] <- 0
sub["B", "G"] <- 0
sub["B", "T"] <- 0
This covers all cases where the pattern (con
) contains ambiguous bases and the subjects (seq
) are unambiguous, changing the matrix so the subjects are not penalized. However, these edits do not account for cases where both the pattern and the subject have ambiguous encodings at the same position; e.g. ["M","V"]
and ["V","M"]
have the same value associated with them, even though V (A || C || G
) is a superset of M (A || C
). There are also more nuanced cases where the set of bases represented by a pair of ambiguous encodings overlap incompletely in both directions, as with H (A || C || T
) and D (A || G || T
). Given enough time, I could probably figure out the appropriate scores for each case to complete the asymmetric substitution matrix, but I'm surely reinventing the wheel. Can someone supply, as a standalone table or as part of a codebase, an asymmetric nucleotide substitution matrix that satisfies the criteria outlined herein?