How can I reproduce features of signalp TMHMM and phobius?
1
0
Entering edit mode
2.3 years ago
dzisis1986 ▴ 30

Is it possible to use a fasta with protein sequences to predict signal peptide and TM without using signap, TMHMM, phobius ? How can i reproduce- calculate information about signal peptide presence or number of transmembrane helices ? I would like to get similar results like the features Phobius_TM ,Phobius_SP, SignalP_D, TMHMM_TM but without using those programs. Do you think it is possible ?

signal peptide TM transmembrane • 636 views
0
Entering edit mode

I think what you are asking for is what other programs aside from the ones you mentioned can do the same thing?

0
Entering edit mode

Would it be possible to elaborate a bit on this. Why would you want to do so? the tools you mention do a good job in predicting these kind of features.

that being said, there likely exists some (often species dependant) alternatives. eg HECTAR which is specific to predict signal peptides (and cell location) in brown algae.

0
Entering edit mode

I know that those programs are predicting those features in a good way. i am just trying to extract some different features from different programs in the simplest way and thenuse them for further machine learning analysis. i want to use as input a fasta with protein sequences and be able to have a result with those features more or less .

0
Entering edit mode
2.3 years ago
Mensur Dlakic ★ 15k

It is most definitely possible. I suggest you read the papers describing the programs and you will get an idea about the datasets they used, and how predictive models were developed. It is up to you whether you want to use the same data and modeling techniques, as pretty much any modern machine learning tool can be used for this task.

0
Entering edit mode

I read the papers describing the programs carefully and i found more or less how some of the features are calculated but i can't until now understand clearly how those are in practice. i would like to calculate some basic features like presence of signal peptide or existence of transmembrane helices , number of predicted helices or if a protein is secretory or not and then use them in my owm machine learning. I was wondering if it is possible to escape from the installation of a standard program and calculate those features with other simpler methods by using as input the sequences of proteins or if those information are available in a data base like UniProt.

1
Entering edit mode

Same answer as before - it is possible. Whether you can do it depends of your level of interest and your willingness to spend time. I will describe a simple scenario.

You take a set of sequences with signal peptides, and another without them. For each sequence do a BLAST search and build a multiple alignment that includes its homologs. From the alignment you can calculate the frequency of amino acids for each vertical position that corresponds to your original sequence. That will look something like this for a residue that is in helical conformation:

0.6816 0.3804 0.6412 0.7080 0.3068 0.4312 0.5708 0.4960 0.6972 0.4936 0.4320 0.6196 0.4232 0.6612 0.6496 0.6340 0.5820 0.5068 0.1860 0.4116


And something like this for a residue that isn't helical:

0.5660 0.4396 0.5880 0.5628 0.4876 0.7664 0.6288 0.4540 0.6780 0.4780 0.4672 0.6856 0.5928 0.5736 0.6056 0.5892 0.5276 0.4376 0.3460 0.4680


Do that for each residue in each of your proteins, assign them target values, and that would be your set of features for training. From that point on, it is simply a matter of applying your machine learning method of choice and verifying its accuracy by holdout, cross-validation, or both.

If you go to Uniprot and enter transmembrane in its search field, it will find ~34 million proteins. Not sure if that's useful to you, but it can be done. Same for signal peptide - it will find ~12 million proteins. As far as individual proteins, yes, they usually have annotation for signal peptide and transmembrane helices. I still think that for a reasonable number of proteins your best bet may be to paste their sequence into websites that predict these properties and wait for predictions. With all respect, I doubt you will be able to create a better classifier without a major investment of time.

0
Entering edit mode

thank you for the detailed description and the the example. i already checked the posibilites in uniprot and i do believe that the existing programs are giving the best results . I am just trying to see if and how can be possible to do something similar without them. i agree that you need a lot of time to re-create all those classifiers. My point is to use basic features in the simpest way and then use a more complicated machine learning to combine all those results and extract the information i need ! it would be good for me for example to be able to calculate manually features like PhobiusTM , TargetP_SP, TMHMM_AA, SignalP_D.I just dont get how to do it outside of the programs. Those features looks to be result of the machine learning algorithms of each program instead of being just a result of some calculations based on the sequences , the positions etc .