Question: Finding Exon-Intron Boundaries With Orfs : How To Choose The Orfs That Correspond To True Exons ?
gravatar for Francois Olivier Hébert
8.5 years ago by
Francois Olivier Hébert280 wrote:


I have a set of 2,228 withefish (salmonid) genes assembled from an exon capture chip. These sequences each correspond to a complete or a partial gene sequence. Since it's genomic DNA, I would like to find the exons within these sequences. So far, I have used a "blastx" approach using D. rerio's coding sequences, complementing with "nr" database for the sequences with no hits. My Python script can identify many putative exons, but I also have the feeling that it misses a lot of them.

I know I can use an "ORFs" approach by using for example EMBOSS (getorf or sixpack) but the problem is : how to choose the ORFs that correspond to true exons among the many results that I get when identifying ALL the ORFs in my sequences ? If I use a length filter (e.g ORFs > 300 bp), will I miss some small or partial exons ?

I don't know if all of my gene sequences are of VERY good quality, several of them are, but since it's the result of a de novo assembly with genomic DNA and considering the fact that there is a lot of repeated sequences in the whitefish genome, some might be crapy.

So, in brief, my major concern is how to know that the ORFs identified as exons are not partially or entirely in introns and how can I implement a filtering method to keep only the good ones ?

Thank you very much for any help or suggestion !

bioinformatics orf • 7.8k views
ADD COMMENTlink written 8.5 years ago by Francois Olivier Hébert280
gravatar for Larry_Parnell
8.5 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

Try sequence similarity to a protein database, say from salmon or another completed fish genome (zebrafish, pufferfish). This will help you to identify the reading frame of your potential exon. Exons begin at splice the splice acceptor site (typical is AG as last two positions of the preceding intron) and end at the splice donor (typical is GT at first two positions of the next intron). If you use genomic data as your query, you will see the similarity to other proteins fall off at/near these splice junctions. An ORF can be smaller than 300 bp, so a size filter is not recommended.

You can also keep track of the coordinates of the matching protein. If you have a match of one exon ending at protein pos. 212 and another beginning at pos. 243, there is likely to be a missing exon of 30 a.a. or 90 (or 89 or 91) bp encoding those 30 amino acids.

An ORF that is not really part of a protein, will not match well to other fish genes/proteins. Furthermore, as exons for a given gene are contiguous, if an ORF does have a spurious match to a protein, the neighboring ORF likely will not match that same protein at the next subsequent positions of that gene/protein.

Lastly, be careful of ORFs encoded by repeat elements. These may encode proteins (say protein elements of an LTR retrotransposon), and should be handled differently.

ADD COMMENTlink written 8.5 years ago by Larry_Parnell16k

Thank you very much for the help! Now I see how I can build some kind of pipeline analysis according to what you wrote. This is always more complicated to deal with coding regions with gDNA in a "non model" species (i.e no reference genome). Cheers !

ADD REPLYlink written 8.5 years ago by Francois Olivier Hébert280
gravatar for JC
8.5 years ago by
JC12k wrote:

Maybe you can extend your ORF predictions with a better systems, try Augustus [] trained with your sequences to predict the gene models in your genomic context.

ADD COMMENTlink written 8.5 years ago by JC12k

Thanks ! I will definitely take a look at that program and maybe use it in my pipeline analysis. This seems to be a quite "complete" application. I will also share it with other lab members.

ADD REPLYlink written 8.5 years ago by Francois Olivier Hébert280
gravatar for Francois Olivier Hébert
8.5 years ago by
Francois Olivier Hébert280 wrote:

To conclude on this topic, it can be extremely complicated to discriminate which ORF's you have to keep and which ORF's you have to discard when you have 48 results for the same contigs and 30 of them blast against any protein in "nr" database on genbank. Depending on the quality of your sequences, it can be almost impossible to make the right decision (working with gDNA can be very complex!).

In this case, you can use a nice program that I found, which is called "asp". You can either download the source code and install it on your machine or you can use the GALAXY service by importing your data (FASTA file) on the website. You have all the information that you need here :

The only thing you need is your FASTA sequences (genomic DNA). The program will identify the splice sites in your sequences (donor and acceptor). It uses a rigorous algorithm to give a score to each splice site found (AG, GT, GC) and based on this information, you can retrieve the exon-intron boundaries. You can specify the "type of organism" you are using (Fish, Human, Cress, Worm, Fly). Most of the gene predictor programs need a lot of information to work and in my case, I didn't have all what was needed... so I used ASP and it worked fine. I can predict with a pretty good confidence (based on the score given in the output file) the exons in my genes.

Enjoy !

ADD COMMENTlink written 8.5 years ago by Francois Olivier Hébert280

The link in the above does not work.

ADD REPLYlink written 4.7 years ago by Prakki Rama2.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1287 users visited in the last hour