Question

Protien fasta AA index

0

Entering edit mode

4.0 years ago

Jacob jr • 0

Hi Friends,

I am new to these Bioinformatics and related machine learning like things. And I am beginning my project on protein classification using machine learning. What I do have is two fasta files of two classes of proteins. To do machine learning on it, I need to convert it into a .csv file having features. I have no idea where to start with. It would be a great support if anyone could help me load the AA indices from here: ftp://ftp.genome.jp/pub/db/community/aaindex/. I am attaching the photo of my fasta file along with this here: https://ibb.co/CHNzvnH And thanks in advance.

sequence Protein Machine learning csv fasta • 1.3k views

ADD COMMENT • link updated 4.0 years ago by Mensur Dlakic ★ 27k • written 4.0 years ago by Jacob jr • 0

0

Entering edit mode

Maybe break the whole thing down into subtasks:

You have:

two fasta files of two classes of proteins.
a database that translates amino acids into numeric values according to their physicochemical properties

You want:

numeric vectors representing each protein sequence

Tasks:

You already have the link to AAindex, now download the database files from there and have a look at them. How are the amino acids represented there?

ADD REPLY • link 4.0 years ago by cschu181 ★ 2.8k

score 0 · Answer 1 · 2020-04-19

There are many ways to create protein features. Some of them are very fast but not necessarily very discriminate in the end. Since I have no idea how comfortable you might be using command-line tools vs web servers, here is a quick list of both:

Separately, I recommend SPBuild as a very good feature generator, that also happens to be fast.

All of these were in the very fast category. To generate protein features in a way that allows you to do best classification, one most likely will need to do so from protein alignments that capture sequence conservation. This usually requires lots of sequence searching, and it isn't fast because protein databases are on the order of hundreds of millions. In a nutshell: 1) do iterative searching with a given sequence using tools such as BLAST or HHpred; 2) make a multiple alignment of the query and all the matches, and extract frequencies of all amino-acids for each alignment column. In the end it will look something like this:

KTFKLEIVTPEGVLFSGEVESVTVPGVEGELGILPGHAPLITALKPGELRIRDEDGKEEEFA
0.09506 0.03771 0.12736 0.05856 0.00225 0.03642 0.06066 0.05274 0.01786 0.00824 0.02066 0.17123 0.06920 0.00532 0.02785 0.13040 0.05437 0.00140 0.00747 0.01523
0.04374 0.03906 0.03994 0.02633 0.00543 0.03683 0.04880 0.02242 0.01231 0.02799 0.07818 0.11211 0.04138 0.01949 0.04011 0.12526 0.21628 0.00219 0.01595 0.04623
0.02107 0.00457 0.00278 0.00219 0.00501 0.00370 0.00491 0.00498 0.00271 0.14083 0.23245 0.00468 0.13822 0.26517 0.00470 0.00707 0.01949 0.00315 0.03312 0.09924
0.02973 0.07705 0.08937 0.05938 0.00309 0.09382 0.06901 0.01055 0.09896 0.02433 0.04525 0.15262 0.00993 0.01446 0.02082 0.06207 0.08597 0.00377 0.02388 0.02597
0.02101 0.00152 0.00112 0.00093 0.03052 0.00144 0.00160 0.00267 0.00102 0.11728 0.47205 0.00164 0.01367 0.09138 0.00127 0.00272 0.01134 0.00225 0.00817 0.21643

Or like this:

Last position-specific scoring matrix computed
            A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W   Y   V
    1 S     1   0   0   0  -2   1   0   0  -1  -2  -2   0  -1  -2   0   3   1  -2  -1  -1
    2 T     0   0   0   0  -2   0   0  -1  -1  -1  -1   0  -1  -1   0   1   3  -1  -1   0
    3 Y    -1  -1  -1  -1  -2  -1  -1  -2   0   0   0  -2   0   2  -1   0   0   2   5   0
    4 H    -1   1   1   0  -2   1   0  -1   5  -2  -1   0  -1  -1   0   1   0   0   1  -2
    5 L     0  -1  -2  -2  -1  -1  -2  -2  -1   1   2  -2   1   1  -1  -1   0   1   1   1
    6 D    -1   0   1   4  -3   1   1  -1   0  -3  -2   0  -3  -3   1   1   0  -2  -1  -2
    7 V     0  -2  -2  -2  -1  -1  -2  -2  -2   2   1  -2   0   1   0  -1   0   0   0   2