How to perform alignment of nucleotide sequence containing ambiguous bases
3
0
Entering edit mode
2.1 years ago
mschmidt ▴ 80

I have sequences obtained by Sanger sequencing on metagenomic DNA that contain ambiguous bases (due to amplification of two or more highly homologous sequences). I want to look up for best possible matches from database. BLAST finds matches but treats ambiguous bases as mismatch. I plan to take the sequences found by BLAST (mySeq -> BLASTn -> all.BLAST.seqs) and align them back to my sequence to find the best possible match/es taking into account the meaning of ambiguous positions. I think it can be performed by multiple alignment in R with the DECIPHER package to set of all possible sequences generated by Disambiguate.

It would look something like:

library(DECIPHER)
permutations <- Disambiguate(mySeq) #expands ambiguities into all permutations - quite a lot!

all.BLAST.seqs <- OrientNucleotides(all.BLAST.seqs) #to put the seqs found by Blast in the same orientation

aligned <- AlignSeqs(c(all.BLAST.seqs, permutations))

d <- DistanceMatrix(aligned) # calculate a maximum likelihood tree

dend <- IdClusters(d, type="dendrogram", myXStringSet=aligned)

dend <- dendrapply(dend) #to create phylogenetic tree to look for the closest matches to mySeq


However this I believe might not be the best way. Especially as disambiguate may produce quite a lot of sequences.

Does anybody knows better, faster way to find best matching sequences taking into account information from ambiguous positions?

R alignment sequence nucleotide • 2.2k views
0
Entering edit mode

Like blastn, other software tend to treat ambiguities as mismatches which I think is the conservative way of doing it. Also have a look at exonerate, I don't remember exactly how it treats ambiguity codes but as a last resort you could pass it a custom substitution matrix that includes ambiguity codes (I think the matrices are hard-coded in blast). The disambiguate approach could be viable if you parallelize it.
See some more discussion in this previous post.

0
Entering edit mode

Hi! I would not use Disambiguate if understand the output of this function correctly. With this approach you may generate a huge number of artificial sequences which does not exist even in nature not only in your samples. E.g. having a sequence ARTY does not mean that there are mandatory all of four haplotypes AATT, AATC, AGTT and AGTC in your samples. Moreover, a presence of a such ambiguity bases in sanger sequencing reads may be a sequencing artefacts instead of a true sequence variability. If there are a few ambiguity bases in your data i guess it would not influence much on the results of your study.

0
Entering edit mode

Hi! I have expected to obtain sequences with ambiguous positions. I have sequenced amplicons obtained on metagenomic DNA from environmental samples. The PCR was with universal primers targeting conserved regions within phage genome of specific type to amplify sequences from various virus species. The sequences, judging on Sanger chromatogram reads, are fine. There are stretches of sequence with single high peaks, and several positions with two peaks of close height. The disambiguate produces hundreds to even thousands possible sequences. This makes my approach quite troublesome.

3
Entering edit mode
2.1 years ago
mschmidt ▴ 80

I have found solution finally. For those with similar problem:

library(Biostrings)
#generate nucleotide substitution matrix for all 15 base letters needs baseOnly = FALSE
nsm <- nucleotideSubstitutionMatrix(match = 5, mismatch = -4, baseOnly = FALSE, type = "DNA")
s2 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
pairwiseAlignment(s1, s2, substitutionMatrix = nsm, gapOpening = 0, gapExtension = 1)

Global PairwiseAlignmentsSingleSubject (1 of 1)
subject: ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAA-GGAAACGCAAAGTTTTCAAG
score: 231.5


However, in case anyone can add a comment I appreciate very much any response. Best regards, Marcin

0
Entering edit mode

Hi! I'm wondering how substitution from W to R will be treated in nsm matrix? Will it be a mismatch?

1
Entering edit mode
> print(nsm)
A     C     G     T     M     R     W     S     Y     K     V     H     D
A  5.00 -4.00 -4.00 -4.00  0.50  0.50  0.50 -4.00 -4.00 -4.00 -1.00 -1.00 -1.00
C -4.00  5.00 -4.00 -4.00  0.50 -4.00 -4.00  0.50  0.50 -4.00 -1.00 -1.00 -4.00
G -4.00 -4.00  5.00 -4.00 -4.00  0.50 -4.00  0.50 -4.00  0.50 -1.00 -4.00 -1.00
T -4.00 -4.00 -4.00  5.00 -4.00 -4.00  0.50 -4.00  0.50  0.50 -4.00 -1.00 -1.00
M  0.50  0.50 -4.00 -4.00  0.50 -1.75 -1.75 -1.75 -1.75 -4.00 -1.00 -1.00 -2.50
R  0.50 -4.00  0.50 -4.00 -1.75  0.50 -1.75 -1.75 -4.00 -1.75 -1.00 -2.50 -1.00
W  0.50 -4.00 -4.00  0.50 -1.75 -1.75  0.50 -4.00 -1.75 -1.75 -2.50 -1.00 -1.00
S -4.00  0.50  0.50 -4.00 -1.75 -1.75 -4.00  0.50 -1.75 -1.75 -1.00 -2.50 -2.50
Y -4.00  0.50 -4.00  0.50 -1.75 -4.00 -1.75 -1.75  0.50 -1.75 -2.50 -1.00 -2.50
K -4.00 -4.00  0.50  0.50 -4.00 -1.75 -1.75 -1.75 -1.75  0.50 -2.50 -2.50 -1.00
V -1.00 -1.00 -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -2.50 -2.50 -1.00 -2.00 -2.00
H -1.00 -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -2.50 -1.00 -2.50 -2.00 -1.00 -2.00
D -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -1.00 -2.50 -2.50 -1.00 -2.00 -2.00 -1.00
B -4.00 -1.00 -1.00 -1.00 -2.50 -2.50 -2.50 -1.00 -1.00 -1.00 -2.00 -2.00 -2.00
N -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75
B     N
A -4.00 -1.75
C -1.00 -1.75
G -1.00 -1.75
T -1.00 -1.75
M -2.50 -1.75
R -2.50 -1.75
W -2.50 -1.75
S -1.00 -1.75
Y -1.00 -1.75
K -1.00 -1.75
V -2.00 -1.75
H -2.00 -1.75
D -2.00 -1.75
B -1.00 -1.75
N -1.75 -1.75

2
Entering edit mode
2.1 years ago
mschmidt ▴ 80

For for those who prefer DECIPHER it is also possible using custom nucleotide substitution matrix nsm

st = c("ACGTGTACGTAGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTCGTAGTTCGTATGC",
"ACGTGTTCGTAGCTGATCGGCGGATGGCACTGATCGTTTGCTAGTCGTAGTTCGTATGC",
"ACGYGTACGTAGCTGATCGSGGGATGGCCCTGVTCGTATGCTKGTCGTAGTTCGTATGC",
"ACGTGTACGTTGCTGATCGGGGGATGGCCCTGATCGTTTGCTAGTCGTAGTTTCGTATGC",
"ACGKGTACGTWGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTCGTAGTTTCGTATGC",
"ACGTGTACGTAGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTGTAGTTCGTATC")
st = DNAStringSet(st)
aligned <- AlignSeqs(st,
guideTree = NULL,
iterations = 1,
refinements = 1,
gapOpening = -1,
gapExtension = -2,
useStructures = FALSE,
levels = c(0.9, 0.7, 0.7, 0.4, 10, 5, 5, 2),
processors = 1,
verbose = TRUE,
substitutionMatrix = nsm)
BrowseSeqs(aligned, highlight=0)


gives:

                           20                  40                  60
'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|
1    ···T··A···A········GG·······C···A····A····A··C····-·······G·    59
2    ···T··T···A········GC·······A···A····T····A··C····-·······G·    59
3    ···Y··A···A·········G·······C········A····K··C····-·······G·    59
4    ···T··A···T········GG·······C···A····T····A··C····T·······G·    60
5    ···K··A············GG·······C···A····A····A··C····T·······G·    60
6    ···T··A···A········GG·······C···A····A····A··-····-·······-·    57
ConsensusACGBGTWCGTWGCTGATCGSSGGATGGCMCTGVTCGTWTGCTDGT+GTAG+TTCGTAT+C    60

1
Entering edit mode
21 months ago
Biodeer ▴ 40

Hi, a tool named mesquite may help you manually edit the ambiguous bases with alignment sequences. http://mesquiteproject.org/mesquiteArchives/mesquite2.75/Mesquite_Folder/docs/mesquite/Diversification/diversification.html