I used standalone Interproscan5.14-53.0-64 for functional annotation of proteins predicted in my assembled genome. It works wonderfully and is quick. However, since I am working with whole genome sequences, the output is very large.
I want to sort the protein sequences according to their function i.e. domain and calculate the no. of sequences in each class/ domain type to represent my results in graphical form. Filtering the sequences according to domain is possible in excel but filtering one by one and calculating thousands of sequences manually is exhausting and will take weeks.
I am not a computer science person, so can anyone help me find a solution to automate this work. Are there any command lines which I can use? Any help will be appreciated.
Yes, I used the protein sequences predicted by AUGUSTUS. I used multiple databases, pfam, prosite, smart,pathways, all in one command because I was interested in functional annotation of all contigs and if there are some rare or unique protein sequences, atleast one of them will predict the domain.
I have my results in tsv format. I am now interested in counting all translated sequences- domain wise, and want to count the occurrence per protein/ translated sequence to have a final idea or a final table like "protein families predicted in the assembled genome-
calcium kinase- 14;
calmodulin protein-20...... likewise
ADD REPLY
• link
updated 4.8 years ago by
Ram
44k
•
written 8.8 years ago by
mirza
▴
180
0
Entering edit mode
Hello mirza, have you sort out your problem? I want to do the same: count the differents families or domains of hte interproscan output file. Thanks
I think that for annotate you genome is much better to use specific programs (e.g MAKER, PROKKA, RAST). Mostly they use integrated approaches (e.g. TIGRFAM, PFAM, Blast) to annotate the protein encoded in the query genomes. You can search in this blog form more detailed posts on genome annotation.
I would not recommend the approach you used, because it is tedious and also because you have to consider many things in recovery the informations, e.g. protein with more then one domain, protein with domains and repeats, only repeats, protein assigned to family and so on.
Hope this helps
ADD COMMENT
• link
updated 4.8 years ago by
Ram
44k
•
written 8.8 years ago by
dago
★
2.8k
0
Entering edit mode
Thanks dago for ur suggestions.
I actually use BLast2Go for my annotations. But since our server is down and it will take time, I didn't want to waste my time waiting.
Standalone Interproscan5 is also an intergrated one, considers all existing protein databases pfam, prosite, SMART upto SignalP, Gene3D etc. Its really good.
ADD REPLY
• link
updated 4.8 years ago by
Ram
44k
•
written 8.8 years ago by
mirza
▴
180
0
Entering edit mode
Yes you are right. However, is not really clear what is your aim. If you want to annotate the protein encoded in your genome, I would suggest an annotation program. If you are interested in the recurrence of a specific domain, repeat, family of protein you could you Interproscan. I guess as @Michael Dondrup said, you would need some scripting to extract the information from the tsv file.
Yes dago, I understand I need some script or some command lines. Dats d problem actually, coz I am not a CS person, I m a molecular biologist. So, writing scripts is not easy for me and dats why I need help.
If you want domain counts, then you can certainly use InterProScans output. The tabular output would be easiest to parse. I think this is sort of state of the art, Ensembl ran the interproscan pipeline for our genome and they have done this for most Ensembl genomes I have seen. If you go for domains, you will need to run PFAM and then you can also run all the other tools like TMHMM, Prosite, Panther, etc. in one go. However I would focus on one tool, like Pfam domains, because it gets more complicated when trying to compare the predictions of different tool. It will mostly need minimal scripting for parsing the TSV file and extract the PFAM annotations. Pfam terms are not hierarchic, so you can use easy search/grep functionality (unlike GO).
Are you looking for a single or few domains or do you want to tabulate all pfam predictions?
Are you trying to count total occurrence of domains (counting repeated domains twice), or do want to count the occurrence per protein (at least 1 domain)?
So, if you could specify a bit more of what you are after exactly that would help a lot.
[I hope you ran the pipeline on the predicted protein sequences, not on the full genomic DNA (that would not work even I guess), is this correct?]
Yes, I used the protein sequences predicted by AUGUSTUS. I used multiple databases, pfam, prosite, smart,pathways, all in one command because I was interested in functional annotation of all contigs and if there are some rare or unique protein sequences, atleast one of them will predict the domain.
I have my results in tsv format. I am now interested in counting all translated sequences- domain wise, and want to count the occurrence per protein/ translated sequence to have a final idea or a final table like "protein families predicted in the assembled genome-
calcium kinase- 14;
calmodulin protein-20...... likewise
Hello mirza, have you sort out your problem? I want to do the same: count the differents families or domains of hte interproscan output file. Thanks
had to calculate manually in excel using filters and some functions.