Question

Finding Exon-Intron Boundaries With Orfs : How To Choose The Orfs That Correspond To True Exons ?

1

Entering edit mode

11.9 years ago

Francois Olivier Hébert ▴ 280

Hi,

I have a set of 2,228 whitefish (salmonid) genes assembled from an exon capture chip. These sequences each correspond to a complete or a partial gene sequence. Since it's genomic DNA, I would like to find the exons within these sequences. So far, I have used a "blastx" approach using D. rerio's coding sequences, complementing with "nr" database for the sequences with no hits. My Python script can identify many putative exons, but I also have the feeling that it misses a lot of them.

I know I can use an "ORFs" approach by using for example EMBOSS (getorf or sixpack) but the problem is : how to choose the ORFs that correspond to true exons among the many results that I get when identifying ALL the ORFs in my sequences? If I use a length filter (e.g ORFs > 300 bp), will I miss some small or partial exons?

I don't know if all of my gene sequences are of VERY good quality, several of them are, but since it's the result of a de novo assembly with genomic DNA and considering the fact that there is a lot of repeated sequences in the whitefish genome, some might be crappy.

So, in brief, my major concern is how to know that the ORFs identified as exons are not partially or entirely in introns and how can I implement a filtering method to keep only the good ones?

Thank you very much for any help or suggestion!

orf • 9.4k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 11.9 years ago by Francois Olivier Hébert ▴ 280

score 2 · Answer 1 · 2012-05-21

Try sequence similarity to a protein database, say from salmon or another completed fish genome (zebrafish, pufferfish). This will help you to identify the reading frame of your potential exon. Exons begin at splice the splice acceptor site (typical is AG as last two positions of the preceding intron) and end at the splice donor (typical is GT at first two positions of the next intron). If you use genomic data as your query, you will see the similarity to other proteins fall off at/near these splice junctions. An ORF can be smaller than 300 bp, so a size filter is not recommended.

You can also keep track of the coordinates of the matching protein. If you have a match of one exon ending at protein pos. 212 and another beginning at pos. 243, there is likely to be a missing exon of 30 a.a. or 90 (or 89 or 91) bp encoding those 30 amino acids.

An ORF that is not really part of a protein, will not match well to other fish genes/proteins. Furthermore, as exons for a given gene are contiguous, if an ORF does have a spurious match to a protein, the neighboring ORF likely will not match that same protein at the next subsequent positions of that gene/protein.

Lastly, be careful of ORFs encoded by repeat elements. These may encode proteins (say protein elements of an LTR retrotransposon), and should be handled differently.

score 1 · Answer 2 · 2012-05-21

1

Entering edit mode

11.9 years ago

JC 13k

Maybe you can extend your ORF predictions with a better systems, try Augustus [http://bioinf.uni-greifswald.de/augustus/] trained with your sequences to predict the gene models in your genomic context.

ADD COMMENT • link 11.9 years ago by JC 13k

0

Entering edit mode

Thanks ! I will definitely take a look at that program and maybe use it in my pipeline analysis. This seems to be a quite "complete" application. I will also share it with other lab members.

ADD REPLY • link 11.9 years ago by Francois Olivier Hébert ▴ 280

score 0 · Answer 3 · 2012-06-02

To conclude on this topic, it can be extremely complicated to discriminate which ORF's you have to keep and which ORF's you have to discard when you have 48 results for the same contigs and 30 of them blast against any protein in "nr" database on genbank. Depending on the quality of your sequences, it can be almost impossible to make the right decision (working with gDNA can be very complex!).

In this case, you can use a nice program that I found, which is called "asp". You can either download the source code and install it on your machine or you can use the GALAXY service by importing your data (FASTA file) on the website. You have all the information that you need here :

http://people.tuebingen.mpg.de/vipin/www.fml.tuebingen.mpg.de/raetsch//projects/splice/

The only thing you need is your FASTA sequences (genomic DNA). The program will identify the splice sites in your sequences (donor and acceptor). It uses a rigorous algorithm to give a score to each splice site found (AG, GT, GC) and based on this information, you can retrieve the exon-intron boundaries. You can specify the "type of organism" you are using (Fish, Human, Cress, Worm, Fly). Most of the gene predictor programs need a lot of information to work and in my case, I didn't have all what was needed... so I used ASP and it worked fine. I can predict with a pretty good confidence (based on the score given in the output file) the exons in my genes.

Enjoy !