Question: Protien fasta AA index
0
gravatar for Jacob jr
12 weeks ago by
Jacob jr0
Jacob jr0 wrote:

Hi Friends,

I am new to these Bioinformatics and related machine learning like things. And I am beginning my project on protein classification using machine learning. What I do have is two fasta files of two classes of proteins. To do machine learning on it, I need to convert it into a .csv file having features. I have no idea where to start with. It would be a great support if anyone could help me load the AA indices from here: ftp://ftp.genome.jp/pub/db/community/aaindex/. I am attaching the photo of my fasta file along with this here: https://ibb.co/CHNzvnH And thanks in advance.

ADD COMMENTlink modified 12 weeks ago by Mensur Dlakic5.8k • written 12 weeks ago by Jacob jr0

Maybe break the whole thing down into subtasks:

You have:

  • two fasta files of two classes of proteins.
  • a database that translates amino acids into numeric values according to their physicochemical properties

You want:

  • numeric vectors representing each protein sequence

Tasks:

  1. You already have the link to AAindex, now download the database files from there and have a look at them. How are the amino acids represented there?
ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by cschu1812.3k
0
gravatar for Mensur Dlakic
12 weeks ago by
Mensur Dlakic5.8k
USA
Mensur Dlakic5.8k wrote:

There are many ways to create protein features. Some of them are very fast but not necessarily very discriminate in the end. Since I have no idea how comfortable you might be using command-line tools vs web servers, here is a quick list of both:

Separately, I recommend SPBuild as a very good feature generator, that also happens to be fast.

All of these were in the very fast category. To generate protein features in a way that allows you to do best classification, one most likely will need to do so from protein alignments that capture sequence conservation. This usually requires lots of sequence searching, and it isn't fast because protein databases are on the order of hundreds of millions. In a nutshell: 1) do iterative searching with a given sequence using tools such as BLAST or HHpred; 2) make a multiple alignment of the query and all the matches, and extract frequencies of all amino-acids for each alignment column. In the end it will look something like this:

KTFKLEIVTPEGVLFSGEVESVTVPGVEGELGILPGHAPLITALKPGELRIRDEDGKEEEFA
0.09506 0.03771 0.12736 0.05856 0.00225 0.03642 0.06066 0.05274 0.01786 0.00824 0.02066 0.17123 0.06920 0.00532 0.02785 0.13040 0.05437 0.00140 0.00747 0.01523
0.04374 0.03906 0.03994 0.02633 0.00543 0.03683 0.04880 0.02242 0.01231 0.02799 0.07818 0.11211 0.04138 0.01949 0.04011 0.12526 0.21628 0.00219 0.01595 0.04623
0.02107 0.00457 0.00278 0.00219 0.00501 0.00370 0.00491 0.00498 0.00271 0.14083 0.23245 0.00468 0.13822 0.26517 0.00470 0.00707 0.01949 0.00315 0.03312 0.09924
0.02973 0.07705 0.08937 0.05938 0.00309 0.09382 0.06901 0.01055 0.09896 0.02433 0.04525 0.15262 0.00993 0.01446 0.02082 0.06207 0.08597 0.00377 0.02388 0.02597
0.02101 0.00152 0.00112 0.00093 0.03052 0.00144 0.00160 0.00267 0.00102 0.11728 0.47205 0.00164 0.01367 0.09138 0.00127 0.00272 0.01134 0.00225 0.00817 0.21643

Or like this:

Last position-specific scoring matrix computed
            A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W   Y   V
    1 S     1   0   0   0  -2   1   0   0  -1  -2  -2   0  -1  -2   0   3   1  -2  -1  -1
    2 T     0   0   0   0  -2   0   0  -1  -1  -1  -1   0  -1  -1   0   1   3  -1  -1   0
    3 Y    -1  -1  -1  -1  -2  -1  -1  -2   0   0   0  -2   0   2  -1   0   0   2   5   0
    4 H    -1   1   1   0  -2   1   0  -1   5  -2  -1   0  -1  -1   0   1   0   0   1  -2
    5 L     0  -1  -2  -2  -1  -1  -2  -2  -1   1   2  -2   1   1  -1  -1   0   1   1   1
    6 D    -1   0   1   4  -3   1   1  -1   0  -3  -2   0  -3  -3   1   1   0  -2  -1  -2
    7 V     0  -2  -2  -2  -1  -1  -2  -2  -2   2   1  -2   0   1   0  -1   0   0   0   2
ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by Mensur Dlakic5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 796 users visited in the last hour