Question

Selecting primary transcript for each locus from a GFF file

0

Entering edit mode

8.6 years ago

arnstrm ★ 1.8k

Hello,

I am planning to use Maker predicted genes for identifying orthologs among the closely related species. But Maker has predicted multiple transcripts for each locus (because of multiple gene predictors that were used in Maker as well as multiple isoforms for the genes). Although, I am using only predictions with AED scores <1.0, I still have many models for each locus. My question is, what is the best way to chose a transcript for a region? Should I select the longest coding sequence for that region? Are there any program that can perform this step?

Thanks for any help!

orthologs annotations gff predictions • 3.7k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by arnstrm ★ 1.8k

1

Entering edit mode

8.6 years ago

Joseph Pearson ▴ 480

Many genes will have multiple splice variants with identical CDSs, so that strategy might not be sufficient. You could use the mostly strongly expressed RNA (sort by gene, then gene expression, and filter for the best using awk or Excel), but that will frequently differ between tissues. In summary, there's a good reason why multiple transcripts exist; there is not one "best" transcript. That being said, most transcript variants will be substantially similar, so if you arbitrarily choose among mRNAs with similar evidence (functional genomics/transcriptomic data), you will be able to identify orthologs from the common regions of each transcript.

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by Joseph Pearson ▴ 480

Ram · Accepted Answer · 2015-09-22

You could use EvidenceModeler to get a consensus prediction.

Another approach could be clustering orthologs using all predicted genes, then prune the clusters using some criterion (longest transcript is not necessarily the best). Agalma pipeline uses this later approach, though the paper do not details how this is performed (and Agalma is designed for primarily to RNAseq data sets).