Question

How to get predicted protein sequence from a genome?

0

Entering edit mode

8.2 years ago

qingxiangg ▴ 40

Hi everyone,

I'm really a beginner and wonder how to get predicted protein sequence from a genome?

For example,

Here is a Genbank assembly of a genome (quite small), and inside there is no available protein sequence to download (e.g. XXX.pep.faa ). Then I try to predict protein sequence by myself useing software like EVM. But when I download the gff3 file of this genome, the gff3 doesn't seem like a standard gff3 file.Instead of seeing the information of CDS or exon, Isee a lot of 'Genbank: URL....'.

Some others suggest me to predict the protein sequence by predicting the ORF first and translate using standard codon.

Do you have any idea about this? Thanks for your time!

genome next-gen • 3.9k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 8.2 years ago by qingxiangg ▴ 40

1

Entering edit mode

I've been looking at those GFF files and it seems they do not have information about proteins. For bacteria, which have no introns, you can predict ORFs, that's a nice starting point. For this organism (a myxozoan), I would first determine whether it contains introns or not. A blastx of these contigs against known proteins will help you answer this, and also will help you identify proteins.

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 8.2 years ago by abascalfederico ★ 1.2k

0

Entering edit mode

Thanks aba! I think I've got your idea!

To evaluate whether intron exists in this Myxozoan file (An eukaryotes), I select some sequence from it and do blastx.

I did get some hits, like:

Range 1: 320 to 411

Query  412  YENNVLNIRQFKHSPHPYWLPNFMNVFTWSIPFVGEKSYIYPYIFL*NINHVNS*LVMEI 233
            YENNV+NIRQF  SPHPYWLPNFM+VFTWS+PFVGEK                   V E+
Sbjct  320  YENNVMNIRQFNCSPHPYWLPNFMDVFTWSLPFVGEK-------------------VTEM  360

Query  232  LDAILRIASE-----DTDDSEVITQELTRKDIVKNKIRAVGRMSRLFGILR  95
            L +IL I S+     D DD+        RK+I++NKIRA+G+M+R+F +LR
Sbjct  361  LVSILNICSDDELLSDGDDTFEGGSAAARKEIIRNKIRAIGKMARVFTVLR  411

Range 2: 283 to 313

Query  566  SYRMYKKNSSTGFPSLITIFSAPNYLDVYNN  474
             YRMY+K+ +TGFPSLITIFSAPNYLDVYNN
Sbjct  283  GYRMYRKSPATGFPSLITIFSAPNYLDVYNN  313

.....

So I found some of them contain several introns.(This makes sense because this is an eukaryotes)

So next step, for genome protein prediction, I will do blastx the genome against known database (Nr, Swiss e.g.).

In this way, I'll get some blast-based protein, but what should I do with the remained un-matched sequence?

For transcriptome, software like estscan or transdecoder can solve this, how about genome?

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 8.2 years ago by qingxiangg ▴ 40

0

Entering edit mode

I would try some gene prediction software. Look for one that uses protein evidence (blastx or similar). I have no experience with this so I cannot recommend you one in particular, but that's what should be done for this case.

I've seen in NCBI's taxonomy, that there are already 15,204 proteins characterised for myxozoans (look for myxozoa and click on "Protein". These sequences would be the most valuable, but they may not cover all your protein-coding genes.

BTW, you focused on "Subject" coordinates, but I think you wanted to look at "Query" coordinates (your query DNA fragments). Yes, Myxozoans have introns.

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 8.2 years ago by abascalfederico ★ 1.2k