Get protein domain information from nucleotide sequence of assembled contigs?
1
0
Entering edit mode
2.2 years ago
DanielC ▴ 140

Dear Friends,

I have a set of predicted ORFs/CDS. I want to see if there are protein domains available for these CDS. Can this be done using Interproscan? or by any other accurate method? Thank you for your time.

contig domain proteins interprp hmmpred • 585 views
0
Entering edit mode
2.2 years ago

Interproscan is indeed the preferred approach and likely the most comprehensive one. Alternatively you could opt to go for a 'subset analysis' by for instance only looking at pfam domains but the speed gain will only be minimal .

Important to know is that all these work best (only?) on protein level so you will have to translate your CDS sequences to protein sequences first.

0
Entering edit mode

Thanks for the confirmation! Can you please let me know about the best approach to translate these CDS? Since these are CS/ORFs I want to make sure that the start and stop codons are treated properly when translated. Thanks!

0
Entering edit mode

If you already have (accurate) CDS predictions you can simply translate them (EMBOSS transeq for instance, or any custom implementation will do as well).

If you only have 'transcripts' I could recommend TransDecoder, FrameD, ... generally denoted as ORF-finders, which will give you ORFs/CDSs with corresponding protein.

Alternatively you can even submit nucleotide sequences to interproscan who will then do an internal translation of the CDS to protein (using getorf from the EMBOSS package) bu this will not give you any control on which ORF to use in case of multiple options (== this would not be my preferred approach as I would personally would like to have more control on the ORF prediction)

An alternative for the alternative (and depending on the species you work on) something like TRAPID , might be an option as well, this will do lots of analyses at once but might be a bit overkill.

0
Entering edit mode

Thanks! These CDS are predicted by Glimmer.

So, if I understand correct, the best approach is to get the CDS translated using EMBOSS (outside from Interproscan)? Then, run Interproscan? And, since I need to do this for many CDS, can this be done locally? Please let me know. Thanks!

0
Entering edit mode

exactly!

the EMBOSS package was just a suggestion, there are several other options. Key thing is to take one that simply does translation of your CDSs and thus not CDSs/ORF finding .

Not an expert user of Glimmer but does it not provide proteins as output as well (perhaps by activating a parameter?)

0
Entering edit mode

I don't think Glimmer outputs CDS nucleotide and protein fasta file. I am looking for a way to translate the CDSs to proteins standalone. To be honest, not finding any suitable one. Could you please recommend one to me? Thanks! I am tryingto install transeq of EMBOSS but, transeq gives 6 frames and I just want translation of my CDSs. Moreover, I have prokka annotation result so I know what the translated protein sequence looks like, which is helpful for me to compare the result of translation from another program. Thanks

0
Entering edit mode

OK, I got to translate the CDS sequences to proteins using transeq. However, there is one issue I am facing, not all the starts are translated to methionine and I want to. I used the command:

transeq -sequence xx.fasta -outseq yy.fasta -trim -methionine

but in the ouptut protein fasta file, I do see some protein sequences starting with L and other amino acids than methionine. Any idea on how I can make transeq use only methionine for start codon? thanks!

0
Entering edit mode

Are you sure that all the sequences you provided to transeq start with an ATG?

Moreover, I think you will have to provide the methionine parameter as -methionine y . Why did you add the -trim option btw? if you have CDSs they should all end in a stop (or end of sequence), no?

0
Entering edit mode

You can limit transeq to only decode the +1 frame of your CDS (or any other frame). Have a look at the manual fro transeq

If you have prokka results, then those should come with proteins, no?

0
Entering edit mode

Thanks for the response! Most of the CDS start with ATG, about 90 to 95%. Moreover, when looking at prokka results, I see that all CDSs start with methionine. To translate, I just used the above command and got translations and didn't have to give any additional parameter for frames. I tried "-methionine y" but still not all star codons are converted to methionine. I gave "trim" to trim the "*" characters from the end of translated proteins.

One thing to mention here is: Yes, I have prokka protein translations of CDSs. I am using prokka and glimmer to produce CDSs and compare them for any missing ones. Thought this is a good approach rather than relying only on one ORF prediction tool.

Also, I see that CDSs translated from prokka have all methionines for the start codon. I compared it with glimmer CDSs. I don't think it matters much, but just wondering if I leave the translated proteins start codons to as it is in glimmer, does it really matter? All I am looking to do is, get annotations from prokka and glimmer and view them in a viewer or annotator for manual curations, if required. Thanks!

1
Entering edit mode

usually this will not matter much, as biological those domains are rarely found in N or C terminal regions of proteins (there are exceptions though). One thing you really do need to pay attention to is the 'reading frame' as long as you're using the correct reading frame you'll likely be OK .

I bluntly assumed you had transcripts you defined ORFs on , but I start to think you're talking about a gene prediction result on genomic data, is that so? If Glimmer outputs a gff (or such) file you can also convert that to CDSs / proteins . have a look here on the forum or on the internet to see how that's most easily done.