Question: How can I reproduce features of signalp TMHMM and phobius?
gravatar for dzisis1986
10 months ago by
dzisis198620 wrote:

Is it possible to use a fasta with protein sequences to predict signal peptide and TM without using signap, TMHMM, phobius ? How can i reproduce- calculate information about signal peptide presence or number of transmembrane helices ? I would like to get similar results like the features Phobius_TM ,Phobius_SP, SignalP_D, TMHMM_TM but without using those programs. Do you think it is possible ?

ADD COMMENTlink modified 10 months ago by Mensur Dlakic5.8k • written 10 months ago by dzisis198620

I think what you are asking for is what other programs aside from the ones you mentioned can do the same thing?

ADD REPLYlink written 10 months ago by Adrian Pelin2.4k

Would it be possible to elaborate a bit on this. Why would you want to do so? the tools you mention do a good job in predicting these kind of features.

that being said, there likely exists some (often species dependant) alternatives. eg HECTAR which is specific to predict signal peptides (and cell location) in brown algae.

ADD REPLYlink written 10 months ago by lieven.sterck7.9k

I know that those programs are predicting those features in a good way. i am just trying to extract some different features from different programs in the simplest way and thenuse them for further machine learning analysis. i want to use as input a fasta with protein sequences and be able to have a result with those features more or less .

ADD REPLYlink written 10 months ago by dzisis198620
gravatar for Mensur Dlakic
10 months ago by
Mensur Dlakic5.8k
Mensur Dlakic5.8k wrote:

It is most definitely possible. I suggest you read the papers describing the programs and you will get an idea about the datasets they used, and how predictive models were developed. It is up to you whether you want to use the same data and modeling techniques, as pretty much any modern machine learning tool can be used for this task.

ADD COMMENTlink written 10 months ago by Mensur Dlakic5.8k

I read the papers describing the programs carefully and i found more or less how some of the features are calculated but i can't until now understand clearly how those are in practice. i would like to calculate some basic features like presence of signal peptide or existence of transmembrane helices , number of predicted helices or if a protein is secretory or not and then use them in my owm machine learning. I was wondering if it is possible to escape from the installation of a standard program and calculate those features with other simpler methods by using as input the sequences of proteins or if those information are available in a data base like UniProt.

ADD REPLYlink modified 10 months ago • written 10 months ago by dzisis198620

Same answer as before - it is possible. Whether you can do it depends of your level of interest and your willingness to spend time. I will describe a simple scenario.

You take a set of sequences with signal peptides, and another without them. For each sequence do a BLAST search and build a multiple alignment that includes its homologs. From the alignment you can calculate the frequency of amino acids for each vertical position that corresponds to your original sequence. That will look something like this for a residue that is in helical conformation:

0.6816 0.3804 0.6412 0.7080 0.3068 0.4312 0.5708 0.4960 0.6972 0.4936 0.4320 0.6196 0.4232 0.6612 0.6496 0.6340 0.5820 0.5068 0.1860 0.4116

And something like this for a residue that isn't helical:

0.5660 0.4396 0.5880 0.5628 0.4876 0.7664 0.6288 0.4540 0.6780 0.4780 0.4672 0.6856 0.5928 0.5736 0.6056 0.5892 0.5276 0.4376 0.3460 0.4680

Do that for each residue in each of your proteins, assign them target values, and that would be your set of features for training. From that point on, it is simply a matter of applying your machine learning method of choice and verifying its accuracy by holdout, cross-validation, or both.

If you go to Uniprot and enter transmembrane in its search field, it will find ~34 million proteins. Not sure if that's useful to you, but it can be done. Same for signal peptide - it will find ~12 million proteins. As far as individual proteins, yes, they usually have annotation for signal peptide and transmembrane helices. I still think that for a reasonable number of proteins your best bet may be to paste their sequence into websites that predict these properties and wait for predictions. With all respect, I doubt you will be able to create a better classifier without a major investment of time.

ADD REPLYlink written 10 months ago by Mensur Dlakic5.8k

thank you for the detailed description and the the example. i already checked the posibilites in uniprot and i do believe that the existing programs are giving the best results . I am just trying to see if and how can be possible to do something similar without them. i agree that you need a lot of time to re-create all those classifiers. My point is to use basic features in the simpest way and then use a more complicated machine learning to combine all those results and extract the information i need ! it would be good for me for example to be able to calculate manually features like PhobiusTM , TargetP_SP, TMHMM_AA, SignalP_D.I just dont get how to do it outside of the programs. Those features looks to be result of the machine learning algorithms of each program instead of being just a result of some calculations based on the sequences , the positions etc .

ADD REPLYlink written 10 months ago by dzisis198620
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 956 users visited in the last hour