Question

Solution to extract protein coding genes from nucl seqs

0

Entering edit mode

6.7 years ago

T_18 ▴ 50

Dear all,

This question might be seen as an extension of earlier questions (e.g. https://www.biostars.org/p/5801/)..

I’m currently facing some problems while screening sequence data (both full assembled genomes and “short” sequences, of e.g ncbi nucleotide database) for presence of gene family members (in this case P450). The results I get are very disappointing and I am wondering what I might be doing wrong, that’s the reason why I ask for your expertise…

The steps I take now are roughly: 1. downloading sequence data from NCBI for specific clades 2. Extracting potential protein-coding genes from the raw nucleotide sequences using getORF. This off course results in a lot of rubbish, and to minimize this I set “minimal size” to 450 bp. 3. Running pHMMs and subsequent blastP searches based on a selection of P450 protein sequences from the Pfam database to find potential P450 gene family members from the downloaded sequence data.

As the ideal test I performed this method on the raw nucleotide genbank genome assembly of Helicoverpa armigera (NCBI ID= GCA_002156985.1) for which all coding proteins are known (e.g. for P450, https://www.ncbi.nlm.nih.gov/genome/proteins/13316?genome_assembly_id=319039). So I downloaded all the genome scaffolds and used getORF to extract all ORF’s (in theory this includes all the putative protein-coding genes). I ran HMM and blastP to subsequently extract the potential P450 protein coding genes from this ORF dataset. This resulted in +/- 50 HMM hits. However, from the genome, 116 P450 proteins are known! So, my method seems not to find the correct number of potential P450 genes. Just to make sure that my HMM and blastp method was ok: I did check if it would “hit” the P450 protein from the already annotated protein dataset, and it did.. HMM and blastp nicely found all P450 protein sequences from this dataset. But than why do I only find 50 potential P450 genes from the ORF dataset? Is getORF not the method to extract potential protein genes from raw nucleotide sequences (among all non-protein coding sequences that is extracted along with it)?

Thanks in advance for all your help here!

genome gene blast protein • 1.7k views

ADD COMMENT • link 6.7 years ago by T_18 ▴ 50

score 0 · Answer 1 · 2017-08-12

Your approach does not work well because of splicing. The ORFs you get from getORF run on genomic DNA of a eukaryote are mostly meaningless, or to be more precise only single exon genes will be correctly identified amongst a lot of garbage.

(in theory this includes all the putative protein-coding genes)

No, it does not, due to splicing.

You can run getOrf directly only on bacterial DNA or mRNA sequences, and even there with caution. As an alternative you may want to run the tools on the predicted protein sequences directly, or you need to run a more or less advanced gene prediction yourself if those are not available. If you don't want to invest into more advanced gene prediction, you can use a "poor-man's gene-prediction" by using tblasn, this approach will be much more sensitive than both HMMER and blastP on getOrf.

score 0 · Answer 2 · 2017-08-15

Hi Michael,

Thanks for your quick reply and pointing this out.. Off course! I did not think of splicing. Good to know why I didn't got the results I expected to get. In that way I will first go for the predicted protein sequences, EST's and transcriptome assemblies, with this similar approach. And secondly need to think if I want to invest a lot of time to do a gene prediction approach on the non-predicted sequences..

Thanks again!