Question

Amino Acid identity matrix help

0

Entering edit mode

23 months ago

Confused_human ▴ 20

Hello everyone,

I have a dataset of 4000 protein fasta sequences with me.

Example dataset:

>Acaryochloris_marina_peg_1945
MNILAVIPARYQSQRFPGKPLVMLDERPMVQWVYEAAKSCDFFQDAVVATDSDKIADCVKGFGGKVVMTRDDHLTGTDRVAEVAGFYDDMDVVVNVQGDQPFVTPEALEQLVRPYREGERPEMTTLGCPLDMDEDYASPNAVKVLCDRNGHALYFSRSPIPYFRTQGTVPVYHHLGLYAFRHDFLMQYSQLEPTPFETCEGLEQLRVLEYGYAIKVCQTQKAAIEINTPEDLVKAQLFIQQGMTS
>Acetivibrio_clariflavus_peg_3972
MRTGVIVQTRMGSSRLPGKVMIDICGKPVIEHVIDRLKMSKVLDDIIIATTTSVKDKIIVEQAKRNGVKWFCGSEEDVLSRYYYAARENRLSTVVRVTSDCPLIDPVILDEIVEFYKKNDYLLVTNAGNILEYRTYPRGLDVEVFSFDILEKAFYSAKKPYQREHVTPYIYETYENKIYYYKNNINLSKYRWTLDTEEDLKLISIIFNNFYYKYGRNFGFKDILKFIQSNPQLSKINEHIEQKKIG
>Acetivibrio_clariflavus_peg_4060
MKILFTICGRKGSKGVKSKNIKTFLGFPLAFYTASFIDLFIKRNNWVDSDIVLNTDSENLIDLFKNKLNMPIEIIERDPELAKDYVPKISVIKNCYDVMVERKKVSYDIVIDLDITSPLRRLRDLQSLIEKKLNSNADVVFSVTSARRNPYFNMVKKGENGYERVIESSFNARQEAPNVFDINGSMYAYSPDFLKSGKGLFDGICDIIEMRDTAVLDIDHENDFELMEVIAKYLYSSDNEYNCIRENINNILLKD
>Acetivibrio_saccincola_peg_0278
MKKVVAIIQARMGSTRLPGKVMKNLCGKTVLAHDIERVRQSKYIDEIVIATTKFKEDDIILREALENGAKVYRGSEDDVLRRYYEAAKENKADVIVRITSDCPLIDPFIVDEVIKVYLNSNYDLVTNAGIYPENRTYPRGLDVEVFSFDILKKAFQEAKEMYQREHVTPFIYENSKNIYYYKNDIDYSKYRWTLDTEEDYKLIEIIYKSFIKESIIFILMIF*SCLIICQSYLKLIKMCNKKLSVD
>Acetivibrio_saccincola_peg_0286
MKVSAIIQARTGSSRLPGKVLKEICGLPVLVHVINRVKQAKKVNEIIVATTDKASDEVIVDISEMENIKVFRGSEEDVLERYYKTALHFKSDIIVRITSDNPLTDHRLIDKIVENLIIHNADYSCNNMPSTYPYGLDCECFTFQVLEEAFFNAKDKYEREHVTPYIRENKELFKIVSIKGNDNYSHLRWTLDTQEDYNHIKEIFENLYHKNKYFLTEDIIQFLQENKRI

I want to make a amino acid identity scoring matrix of each protein with each protein. And I want to generate a matrix table for the result. So it means my X-axis is my protein set and Y-axis is the same protein set. It means each protein similarity score will be given with each protein.

I have tried few of the online available tools, AAI profiler and comparem, I couldn't really get exactly what I am looking for. Is there any way that I can do it with the blast search?

Thank you

Linux shell-scripting • 1.6k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 23 months ago by Confused_human ▴ 20

0

Entering edit mode

You should just script this yourself... R or python would be easier than bash

ADD REPLY • link 23 months ago by benformatics 3.9k

0

Entering edit mode

I would advise strongly against it, one could easily cook up some script using blast, doing something but that is most likely going to be detrimental (meaning you are worse off with such a solution than none at all, e.g. because the use of Blast for this task is the wrong approach). Instead of just starting to hack something it is recommended:

Analyze and describe the problem properly
define the question you are trying to answer and what you want to do with your result
check for existing and established solutions to your problem first
if it is a common problem (like phylogenetic analysis) it is very likely a solution already exists.

ADD REPLY • link 23 months ago by Michael 54k

1

Entering edit mode

I think you maybe misread bash as blast but based on OP's question... It would take under 20 lines to write your own function if you had to write your own protein distance function. Assuming you don't want something super complicated. Alternatively, if you use an existing package you can directly generate the 'identity scoring matrix' that OP wants in essentially a single line:

library(DECIPHER)
the.proteins <- AAStringSet(c('AA','PYAF','FFF'))
dist.mat <- DistanceMatrix(the.proteins,correction='Jukes-Cantor')

ADD REPLY • link 23 months ago by benformatics 3.9k

0

Entering edit mode

That looks like a proper existing solution to me, possibly I misunderstood your intent. I wouldn't even call that scripting.

ADD REPLY • link 23 months ago by Michael 54k

0

Entering edit mode

Btw, possibly we shouldn't use Jukes-Cantor for protein sequences? But rather JTT, LG, or WAG?

ADD REPLY • link 23 months ago by Michael 54k

0

Entering edit mode

Hi , thank you for your suggestion. as far I came to know DistanceMatrix will tell me the dissimilarity score between two sequences. I want to know the similarity score .

Is there any parameter for that ?

ADD REPLY • link 23 months ago by Confused_human ▴ 20

0

Entering edit mode

amino acid identity scoring matrix or amino acid similarity scoring matrix ?

ADD REPLY • link 23 months ago by cpad0112 21k

0

Entering edit mode

Hi, I want to know the AAI(Amino Acid identity) score between the sequences.

ADD REPLY • link 23 months ago by Confused_human ▴ 20

0

Entering edit mode

23 months ago

Michael 54k

This is possible, but you should use global alignment or MSA tools for this, Blast is the wrong tool for the job because it does local alignments. Doing MSA is possibly the most convenient approach because you will need that for phylogenetics anyway using e.g. the mafft software should allow to align your sequences. You can do the whole process in MEGA10 as well and it can also calculate a distance matrix directly, you have to check if that works with the number of sequences. There is also protdist in the EMBOSS package, but I don't know how this fares with 4000 sequences. If you want to go for global alignment you could try ggsearch or needle in EMBOSS (Needleman-Wunsch). It all depends on what you want to with the result. If you get a distance matrix, and want a similarity matrix instead it can be converted as well.

Note that the calculated distances depend on the chosen AA substitution model. Also, distance matrices are of limited value when it comes to determining evolutionary relationships.

ADD COMMENT • link 23 months ago by Michael 54k

0

Entering edit mode

23 months ago

benformatics 3.9k

Should be pretty straight-forward in R to do a basic comparison - would be good to read the function manual if you want to alter the other parameters (e.g. gap penalties):

##BiocManager::install("DECIPHER")
library(DECIPHER)
your.proteins <- AAStringSet(c('AA','PYAF','FFF'))
dist.mat <- DistanceMatrix(your.proteins,correction='Jukes-Cantor')

> head(dist.mat)
         [,1]     [,2]     [,3]
[1,]  0.00000 34.89996 34.89996
[2,] 34.89996  0.00000 34.89996
[3,] 34.89996 34.89996  0.00000

ADD COMMENT • link 23 months ago by benformatics 3.9k

0

Entering edit mode

Hi , thank you for your suggestion. as far I came to know DistanceMatrix will tell me the dissimilarity score between two sequences. I want to know the similarity score .

Is there any parameter for that ?

ADD REPLY • link 23 months ago by Confused_human ▴ 20

score 2 · Accepted Answer · 2022-05-19

One way to describe this is by calculating amino-acid identity. There is a simple way to do it with BioPython and lastp:

https://github.com/faylward/lastp_aai

It can be calculated indirectly from a distance matrix, as long as the matrix scale is known (they are typically 0-1 or 0-100). Once the distance matrix is calculated, the similarity matrix is calculated for each entry by subtracting its values from the maximum score.

For example, Clustal Omega calculates a distance matrix on a 0-1 scale:

clustalo --infile=myfile.fas --distmat-out=myfile.matrix -o myfile.aln --outfmt=clu --full --threads=10

The distance matrix will look like this:

10
sequence_1  0.000000 0.890000 0.880000 0.890000 0.910000 0.900000 0.920000 0.930000 0.880000 0.880000
sequence_2  0.890000 0.000000 0.900000 0.900000 0.900000 0.900000 0.890000 0.880000 0.900000 0.900000
sequence_3  0.880000 0.900000 0.000000 0.890000 0.900000 0.910000 0.890000 0.910000 0.910000 0.900000
sequence_4  0.890000 0.900000 0.890000 0.000000 0.890000 0.900000 0.910000 0.890000 0.920000 0.890000
sequence_5  0.910000 0.900000 0.900000 0.890000 0.000000 0.900000 0.900000 0.890000 0.910000 0.910000
sequence_6  0.900000 0.900000 0.910000 0.900000 0.900000 0.000000 0.910000 0.910000 0.890000 0.890000
sequence_7  0.920000 0.890000 0.890000 0.910000 0.900000 0.910000 0.000000 0.880000 0.900000 0.900000
sequence_8  0.930000 0.880000 0.910000 0.890000 0.890000 0.910000 0.880000 0.000000 0.910000 0.890000
sequence_9  0.880000 0.900000 0.910000 0.920000 0.910000 0.890000 0.900000 0.910000 0.000000 0.920000
sequence_10 0.880000 0.900000 0.900000 0.890000 0.910000 0.890000 0.900000 0.890000 0.920000 0.000000

Now, doing (1-entry) for each of the matrix numbers will give a similarity matrix.