Lookup Table error in Sequence Alignment with R
1
1
Entering edit mode
3.1 years ago
pattijane ▴ 10

Hello,

I'm using the following code to read from a text file containing more than 200 fasta files and to align these proteins.

seqs = read.fasta(file = "sequencefasta.txt", seqtype = "AA")
self_aligned_score = rep(NA, length(seqs))

for (i in 1:length(seqs)){
cat('aligning ',i,'/',length(seqs),' with itself..\n')
seq = paste(toupper(getSequence(seqs[i][[1]])), sep="", collapse="")
align_score = pairwiseAlignment(seq, seq, scoreOnly=TRUE, gapExtension=0.5, type="local", substitutionMatrix = "BLOSUM50")
self_aligned_score[i] = align_score
}


But I'm having the following error, which I have no idea about why.

Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,  : key 32 not in lookup table


Here is a snippet of my sequencefasta.txt file:

>sp|O00141|SGK1_HUMAN Serine/threonine-protein kinase Sgk1 OS=Homo sapiens GN=SGK1 PE=1 SV=2
MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNSYACKHPEVQSILKI
SQPQEPELMNANPSPPPSPSQQINLGPSSNPHAKPSDFHFLKVIGKGSFGKVLLARHKAE
GGELFYHLQRERCFLEPRARFYAAEIASALGYLHSLNIVYRDLKPENILLDSQGHIVLTD
FGLCKENIEHNSTTSTFCGTPEYLAPEVLHKQPYDRTVDWWCLGAVLYEMLYGLPPFYSR
NTAEMYDNILNKPLQLKPNITNSARHLLEGLLQKDRTKRLGAKDDFMEIKSHVFFSLINW
FSYAPPTDSFL

>sp|O00311|CDC7_HUMAN Cell division cycle 7-related protein kinase OS=Homo sapiens GN=CDC7 PE=1 SV=1
MEASLGIQMDEPMAFSPQRDRFQAEGSLKKNEQNFKLAGVKKDIEKLYEAVPQLSNVFKI
EDKIGEGTFSSVYLATAQLQVGPEEKIALKHLIPTSHPIRIAAELQCLTVAGGQDNVMGV
KYCFRKNDHVVIAMPYLEHESFLDILNSLSFQEVREYMLNLFKALKRIHQFGIVHRDVKP
SNFLYNRRLKKYALVDFGLAQGTHDTKIELLKFVQSEAQQERCSQNKSHIITGNKIPLSG
PVPKELDQQSTTKASVKRPYTNAQIQIKQGKDGKEGSVGLSVQRSVFGERNFNIHSSISH
ESPAVKLMKQSKTVDVLSRKLATKKKAISTKVMNSAVMRKTASSCPASLTCDCYATDKVC
SICLSRRQQVAPRAGTPGFRAPEVLTKCPNQTTAIDMWSAGVIFLSLLSGRYPFYKASDD
LTALAQIMTIRGSRETIQAAKTFGKSILCSKEVPAQDLRKLCERLRGMDSSTPKLTSDIQ
GHASHQPAISEKTDHKASCLVQTPPGQYSGNSFKKGDSNSCEHCFDEYNTNLEGWNEVPD
EAYDLLDKLLDLNPASRITAEEALLHPFFKDMSL

R sequence alignment fasta protein • 1.8k views
0
Entering edit mode

Which libraries are you using to import the fasta and to align the sequences?

0
Entering edit mode

I use these:

require("Biostrings")
require("seqinr")
require("Rcpp")

0
Entering edit mode

I think that the issue might be that you have some non-standard characters in your sequences. Can you check that you just have aminoacids, no "X"s and no other letters which do not correspond to an aa?

0
Entering edit mode

I checked for "X"s and there was none. Besides, wiki article for aa describes X as a valid letter to describe unknowns, https://www.wikiwand.com/en/Amino_acid#/Table_of_standard_amino_acid_abbreviations_and_properties

0
Entering edit mode

Yes it is standard, but maybe for some reason this library doesn't recognise it. Did you check for all characters which are not standard amino acids or just X?

0
Entering edit mode

I checked for X, U, O, B, Z, and J. There were none of them.

0
Entering edit mode
9 weeks ago
yehong39 • 0

Hi, I ran your code using your example .txt file, it worked fine, no errors. The error message you presented here indicated your sequence has extra space in it, Maybe there are extra spaces at the end or beginning of another sequence? You can try add one more line, after: seq = paste(toupper(getSequence(seqs[i][[1]])), sep="", collapse="") add: seq= gsub(" ","", seq) Then run the code again, see if that works. Good luck!