Where to find an asymmetric nucleotide substitution matrix with IUPAC encoding?
1
1
Entering edit mode
2.0 years ago
acvill ▴ 340

This question was originally posted on Bioinformatics Stack Exchange. At this time, I've received no answers there, so I thought I'd cross-post here.


I am using the pairwiseAlignment function from the Biostrings package to calculate the distance between a consensus sequence and a set of other sequences. Some example code:

# R v4.2.0
library(Biostrings)   # v2.63.3
library(BiocGenerics) # v0.41.2
library(tidyverse)    # v1.3.1

# consensus sequence
con <- DNAStringSet(c("GTTGTGATTTGCTTTCRAATTAGTATCTTTGAACCATTGAAAACAAC"))

# subject sequences
seq <- DNAStringSet(c("ATTGTGATTTGCTTTCAAATTAGTATCTTTAAACCATTGAAAACAGC",
                      "GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC",
                      "GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC",
                      "GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC",
                      "GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC"))

# substitution matrix
sub <- nucleotideSubstitutionMatrix(match = 0, mismatch = 1)

# get alignments
sapply(X = 1:length(seq),
       FUN = function(i) {
         pairwiseAlignment(pattern = con,
                           subject = seq[i],
                           substitutionMatrix = sub) %>%
         score
      }
    ) %>% as_tibble %>% mutate(sequence = as.character(seq))

The resulting tibble looks like this:

# A tibble: 5 × 2
  value sequence                                       
  <dbl> <chr>                                          
1  -3.5 ATTGTGATTTGCTTTCAAATTAGTATCTTTAAACCATTGAAAACAGC
2  -0.5 GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC
3  -0.5 GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC
4  -0.5 GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC
5  -0.5 GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC

Something isn't quite right. Take a look at the alignment of con against seq[2].

con:    GTTGTGATTTGCTTTCRAATTAGTATCTTTGAACCATTGAAAACAAC
        |||||||||||||||| ||||||||||||||||||||||||||||||
seq[2]: GTTGTGATTTGCTTTCAAATTAGTATCTTTGAACCATTGAAAACAAC

Note that the only difference is at the position in the consensus sequence where there is ambiguous IUPAC encoding. Because R can represent A or G, the score for seq[2] should be 0 (a perfect match). If the ambiguous base was instead in seq[2] and con was unambiguous, then the score penalty would be appropriate. The discrepancy here is due to the symmetry of the substitution matrix sub:

           A          C          G          T          M          R          W          S          Y          K
A  0.0000000 -1.0000000 -1.0000000 -1.0000000 -0.5000000 -0.5000000 -0.5000000 -1.0000000 -1.0000000 -1.0000000
C -1.0000000  0.0000000 -1.0000000 -1.0000000 -0.5000000 -1.0000000 -1.0000000 -0.5000000 -0.5000000 -1.0000000
G -1.0000000 -1.0000000  0.0000000 -1.0000000 -1.0000000 -0.5000000 -1.0000000 -0.5000000 -1.0000000 -0.5000000
T -1.0000000 -1.0000000 -1.0000000  0.0000000 -1.0000000 -1.0000000 -0.5000000 -1.0000000 -0.5000000 -0.5000000
M -0.5000000 -0.5000000 -1.0000000 -1.0000000 -0.5000000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -1.0000000
R -0.5000000 -1.0000000 -0.5000000 -1.0000000 -0.7500000 -0.5000000 -0.7500000 -0.7500000 -1.0000000 -0.7500000
W -0.5000000 -1.0000000 -1.0000000 -0.5000000 -0.7500000 -0.7500000 -0.5000000 -1.0000000 -0.7500000 -0.7500000
S -1.0000000 -0.5000000 -0.5000000 -1.0000000 -0.7500000 -0.7500000 -1.0000000 -0.5000000 -0.7500000 -0.7500000
Y -1.0000000 -0.5000000 -1.0000000 -0.5000000 -0.7500000 -1.0000000 -0.7500000 -0.7500000 -0.5000000 -0.7500000
K -1.0000000 -1.0000000 -0.5000000 -0.5000000 -1.0000000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.5000000
V -0.6666667 -0.6666667 -0.6666667 -1.0000000 -0.6666667 -0.6666667 -0.8333333 -0.6666667 -0.8333333 -0.8333333
H -0.6666667 -0.6666667 -1.0000000 -0.6666667 -0.6666667 -0.8333333 -0.6666667 -0.8333333 -0.6666667 -0.8333333
D -0.6666667 -1.0000000 -0.6666667 -0.6666667 -0.8333333 -0.6666667 -0.6666667 -0.8333333 -0.8333333 -0.6666667
B -1.0000000 -0.6666667 -0.6666667 -0.6666667 -0.8333333 -0.8333333 -0.8333333 -0.6666667 -0.6666667 -0.6666667
N -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.7500000
           V          H          D          B     N
A -0.6666667 -0.6666667 -0.6666667 -1.0000000 -0.75
C -0.6666667 -0.6666667 -1.0000000 -0.6666667 -0.75
G -0.6666667 -1.0000000 -0.6666667 -0.6666667 -0.75
T -1.0000000 -0.6666667 -0.6666667 -0.6666667 -0.75
M -0.6666667 -0.6666667 -0.8333333 -0.8333333 -0.75
R -0.6666667 -0.8333333 -0.6666667 -0.8333333 -0.75
W -0.8333333 -0.6666667 -0.6666667 -0.8333333 -0.75
S -0.6666667 -0.8333333 -0.8333333 -0.6666667 -0.75
Y -0.8333333 -0.6666667 -0.8333333 -0.6666667 -0.75
K -0.8333333 -0.8333333 -0.6666667 -0.6666667 -0.75
V -0.6666667 -0.7777778 -0.7777778 -0.7777778 -0.75
H -0.7777778 -0.6666667 -0.7777778 -0.7777778 -0.75
D -0.7777778 -0.7777778 -0.6666667 -0.7777778 -0.75
B -0.7777778 -0.7777778 -0.7777778 -0.6666667 -0.75
N -0.7500000 -0.7500000 -0.7500000 -0.7500000 -0.75

As part of a discussion on Github, I provided the code to manually reassign elements of sub to create an asymmetric substitution matrix:

sub["N", "A"] <- 0
sub["N", "T"] <- 0
sub["N", "G"] <- 0
sub["N", "C"] <- 0
sub["M", "A"] <- 0
sub["M", "C"] <- 0
sub["R", "A"] <- 0
sub["R", "G"] <- 0
sub["W", "A"] <- 0
sub["W", "T"] <- 0
sub["S", "C"] <- 0
sub["S", "G"] <- 0
sub["Y", "C"] <- 0
sub["Y", "T"] <- 0
sub["K", "G"] <- 0
sub["K", "T"] <- 0
sub["V", "A"] <- 0
sub["V", "C"] <- 0
sub["V", "G"] <- 0
sub["H", "A"] <- 0
sub["H", "C"] <- 0
sub["H", "T"] <- 0
sub["D", "A"] <- 0
sub["D", "G"] <- 0
sub["D", "T"] <- 0
sub["B", "C"] <- 0
sub["B", "G"] <- 0
sub["B", "T"] <- 0

This covers all cases where the pattern (con) contains ambiguous bases and the subjects (seq) are unambiguous, changing the matrix so the subjects are not penalized. However, these edits do not account for cases where both the pattern and the subject have ambiguous encodings at the same position; e.g. ["M","V"] and ["V","M"] have the same value associated with them, even though V (A || C || G) is a superset of M (A || C). There are also more nuanced cases where the set of bases represented by a pair of ambiguous encodings overlap incompletely in both directions, as with H (A || C || T) and D (A || G || T). Given enough time, I could probably figure out the appropriate scores for each case to complete the asymmetric substitution matrix, but I'm surely reinventing the wheel. Can someone supply, as a standalone table or as part of a codebase, an asymmetric nucleotide substitution matrix that satisfies the criteria outlined herein?

Bioconductor tidyverse sequence-alignment R Biostrings • 687 views
ADD COMMENT
3
Entering edit mode
20 months ago
acvill ▴ 340

I've answered this question on Bioinformatics.SE.

https://bioinformatics.stackexchange.com/a/19612/3967

ADD COMMENT

Login before adding your answer.

Traffic: 1865 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6