Question: How to perform alignment of nucleotide sequence containing ambiguous bases
0
gravatar for mschmidt
8 days ago by
mschmidt50
mschmidt50 wrote:

I have sequences obtained by Sanger sequencing on metagenomic DNA that contain ambiguous bases (due to amplification of two or more highly homologous sequences). I want to look up for best possible matches from database. BLAST finds matches but treats ambiguous bases as mismatch. I plan to take the sequences found by BLAST (mySeq -> BLASTn -> all.BLAST.seqs) and align them back to my sequence to find the best possible match/es taking into account the meaning of ambiguous positions. I think it can be performed by multiple alignment in R with the DECIPHER package to set of all possible sequences generated by Disambiguate.

It would look something like:

library(DECIPHER) 
permutations <- Disambiguate(mySeq) #expands ambiguities into all permutations - quite a lot! 

all.BLAST.seqs <- OrientNucleotides(all.BLAST.seqs) #to put the seqs found by Blast in the same orientation 

aligned <- AlignSeqs(c(all.BLAST.seqs, permutations)) 

d <- DistanceMatrix(aligned) # calculate a maximum likelihood tree 

dend <- IdClusters(d, type="dendrogram", myXStringSet=aligned) 

dend <- dendrapply(dend) #to create phylogenetic tree to look for the closest matches to mySeq

However this I believe might not be the best way. Especially as disambiguate may produce quite a lot of sequences.

Does anybody knows better, faster way to find best matching sequences taking into account information from ambiguous positions?

alignment sequence R nucleotide • 122 views
ADD COMMENTlink modified 4 days ago • written 8 days ago by mschmidt50

Like blastn, other software tend to treat ambiguities as mismatches which I think is the conservative way of doing it. Also have a look at exonerate, I don't remember exactly how it treats ambiguity codes but as a last resort you could pass it a custom substitution matrix that includes ambiguity codes (I think the matrices are hard-coded in blast). The disambiguate approach could be viable if you parallelize it.
See some more discussion in this previous post.

ADD REPLYlink written 8 days ago by Jean-Karim Heriche22k

Hi! I would not use Disambiguate if understand the output of this function correctly. With this approach you may generate a huge number of artificial sequences which does not exist even in nature not only in your samples. E.g. having a sequence ARTY does not mean that there are mandatory all of four haplotypes AATT, AATC, AGTT and AGTC in your samples. Moreover, a presence of a such ambiguity bases in sanger sequencing reads may be a sequencing artefacts instead of a true sequence variability. If there are a few ambiguity bases in your data i guess it would not influence much on the results of your study.

ADD REPLYlink modified 8 days ago • written 8 days ago by Denis190

Hi! I have expected to obtain sequences with ambiguous positions. I have sequenced amplicons obtained on metagenomic DNA from environmental samples. The PCR was with universal primers targeting conserved regions within phage genome of specific type to amplify sequences from various virus species. The sequences, judging on Sanger chromatogram reads, are fine. There are stretches of sequence with single high peaks, and several positions with two peaks of close height. The disambiguate produces hundreds to even thousands possible sequences. This makes my approach quite troublesome.

ADD REPLYlink written 7 days ago by mschmidt50
2
gravatar for mschmidt
7 days ago by
mschmidt50
mschmidt50 wrote:

I have found solution finally. For those with similar problem:

library(Biostrings)
#generate nucleotide substitution matrix for all 15 base letters needs baseOnly = FALSE
nsm <- nucleotideSubstitutionMatrix(match = 5, mismatch = -4, baseOnly = FALSE, type = "DNA")
s1 <- DNAString("ACTTCRCCAGCTCCYTGGCGGTHAGTTGATCAASGGAADCGCAAAGYTTCAAG")
s2 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
pairwiseAlignment(s1, s2, substitutionMatrix = nsm, gapOpening = 0, gapExtension = 1)

Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: ACTTCRCCAGCTCCYTGGCGGTHAGTTGATCAA-SGGAADCGCAAAGYTT-CAAG
subject: ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAA-GGAAACGCAAAGTTTTCAAG
score: 231.5

However, in case anyone can add a comment I appreciate very much any response. Best regards, Marcin

ADD COMMENTlink written 7 days ago by mschmidt50

Hi! I'm wondering how substitution from W to R will be treated in nsm matrix? Will it be a mismatch?

ADD REPLYlink written 7 days ago by Denis190
1
> print(nsm)
      A     C     G     T     M     R     W     S     Y     K     V     H     D
A  5.00 -4.00 -4.00 -4.00  0.50  0.50  0.50 -4.00 -4.00 -4.00 -1.00 -1.00 -1.00
C -4.00  5.00 -4.00 -4.00  0.50 -4.00 -4.00  0.50  0.50 -4.00 -1.00 -1.00 -4.00
G -4.00 -4.00  5.00 -4.00 -4.00  0.50 -4.00  0.50 -4.00  0.50 -1.00 -4.00 -1.00
T -4.00 -4.00 -4.00  5.00 -4.00 -4.00  0.50 -4.00  0.50  0.50 -4.00 -1.00 -1.00
M  0.50  0.50 -4.00 -4.00  0.50 -1.75 -1.75 -1.75 -1.75 -4.00 -1.00 -1.00 -2.50
R  0.50 -4.00  0.50 -4.00 -1.75  0.50 -1.75 -1.75 -4.00 -1.75 -1.00 -2.50 -1.00
W  0.50 -4.00 -4.00  0.50 -1.75 -1.75  0.50 -4.00 -1.75 -1.75 -2.50 -1.00 -1.00
S -4.00  0.50  0.50 -4.00 -1.75 -1.75 -4.00  0.50 -1.75 -1.75 -1.00 -2.50 -2.50
Y -4.00  0.50 -4.00  0.50 -1.75 -4.00 -1.75 -1.75  0.50 -1.75 -2.50 -1.00 -2.50
K -4.00 -4.00  0.50  0.50 -4.00 -1.75 -1.75 -1.75 -1.75  0.50 -2.50 -2.50 -1.00
V -1.00 -1.00 -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -2.50 -2.50 -1.00 -2.00 -2.00
H -1.00 -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -2.50 -1.00 -2.50 -2.00 -1.00 -2.00
D -1.00 -4.00 -1.00 -1.00 -2.50 -1.00 -1.00 -2.50 -2.50 -1.00 -2.00 -2.00 -1.00
B -4.00 -1.00 -1.00 -1.00 -2.50 -2.50 -2.50 -1.00 -1.00 -1.00 -2.00 -2.00 -2.00
N -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75 -1.75
      B     N
A -4.00 -1.75
C -1.00 -1.75
G -1.00 -1.75
T -1.00 -1.75
M -2.50 -1.75
R -2.50 -1.75
W -2.50 -1.75
S -1.00 -1.75
Y -1.00 -1.75
K -1.00 -1.75
V -2.00 -1.75
H -2.00 -1.75
D -2.00 -1.75
B -1.00 -1.75
N -1.75 -1.75
ADD REPLYlink written 5 days ago by mschmidt50
1
gravatar for mschmidt
4 days ago by
mschmidt50
mschmidt50 wrote:

For for those who prefer DECIPHER it is also possible using custom nucleotide substitution matrix nsm

st = c("ACGTGTACGTAGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTCGTAGTTCGTATGC",
       "ACGTGTTCGTAGCTGATCGGCGGATGGCACTGATCGTTTGCTAGTCGTAGTTCGTATGC",
       "ACGYGTACGTAGCTGATCGSGGGATGGCCCTGVTCGTATGCTKGTCGTAGTTCGTATGC",
       "ACGTGTACGTTGCTGATCGGGGGATGGCCCTGATCGTTTGCTAGTCGTAGTTTCGTATGC",
       "ACGKGTACGTWGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTCGTAGTTTCGTATGC",
       "ACGTGTACGTAGCTGATCGGGGGATGGCCCTGATCGTATGCTAGTGTAGTTCGTATC")
st = DNAStringSet(st)
aligned <- AlignSeqs(st,
                         guideTree = NULL,
                         iterations = 1,
                         refinements = 1,
                         gapOpening = -1,
                         gapExtension = -2,
                         useStructures = FALSE,
                         FUN = AdjustAlignment,
                         levels = c(0.9, 0.7, 0.7, 0.4, 10, 5, 5, 2),
                         processors = 1,
                         verbose = TRUE,
                         substitutionMatrix = nsm)
BrowseSeqs(aligned, highlight=0)

gives:

                           20                  40                  60        
         '''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|        
    1    ···T··A···A········GG·······C···A····A····A··C····-·······G·    59    
    2    ···T··T···A········GC·······A···A····T····A··C····-·······G·    59    
    3    ···Y··A···A·········G·······C········A····K··C····-·······G·    59    
    4    ···T··A···T········GG·······C···A····T····A··C····T·······G·    60    
    5    ···K··A············GG·······C···A····A····A··C····T·······G·    60    
    6    ···T··A···A········GG·······C···A····A····A··-····-·······-·    57    
ConsensusACGBGTWCGTWGCTGATCGSSGGATGGCMCTGVTCGTWTGCTDGT+GTAG+TTCGTAT+C    60
ADD COMMENTlink written 4 days ago by mschmidt50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 835 users visited in the last hour