Question

Several exonerate options; which to choose?

0

Entering edit mode

7.9 years ago

mforthman ▴ 50

I am wanting to identify exon-intron boundaries between transcripts and a genome for a given species using exonerate. For the transcripts, I have both a CDS file and a protein fasta file. I noticed a number of modeling options in exonerate, but honestly I'm not sure which is best for my purpose. I've narrowed the list down to four potential models:

est2genome – This model is similar to the affine:local model, but it also includes intron modelling on the target sequence to allow alignment of spliced to unspliced coding sequences for both forward and reversed genes. This is similar to the alignment models used in programs such as EST_GENOME and sim4.

protein2genome – This model allows alignment of a protein sequence to genomic DNA. This is similar to the protein2dna model, with the addition of modelling of introns and intron phases. This model is simliar to those used by genewise.

coding2genome – This is similar to the est2genome model, except that the query sequence is translated during comparison, allowing a more sensitive comparison.

cdna2genome – This combines properties of the est2genome and coding2genome models, to allow modeling of whole cDNA where a central coding region can be flanked by non-coding UTRs. When the CDS start and end is known, it may be specified using the --annotation option (see below) to permit only the correct coding region to appear in the alignemnt.

I don't necessarily see a need for coding2genome since I have query sequences already translated (i.e., aa fasta file), in which case I could go protein2genome. I'm not sure what the query input should be for est2genome or cdna2genome, but would it be faster/easier and just as accurate to use the CDS file to query against the genome with either of these two programs?

EDIT: Just in case it is useful information, I want to then use the exon-intron boundary data outputted by exonerate to compare exons from this species' transcripts to transcriptomes of other species with reciprocal best hits (blastn) or using blastx.

exonerate exon-intron boundaries • 3.2k views

ADD COMMENT • link updated 7.9 years ago by Giovanni M Dall'Olio 28k • written 7.9 years ago by mforthman ▴ 50

score 0 · Answer 1 · 2016-06-13

est2genome definitely no, as it is related to a technology that is not being used anymore (ESTs). Historically, exonerate became famous because it was good at aligning short sequences (ESTs) to a genome, correctly modeling the exon-intron boundaries and managing introns, which are very long gaps.
protein2genome this is more frequently used when you are searching for orthologues in two distantly related species. For example, you have the sequence of a protein in human, and you want to identify where the same gene is encoded in mouse. The protein sequence get retro-translated to all the possible dna sequences (remember that some aminoacids are encoded by more than one codon), and then aligned. This allows to find very distant orthologies, e.g. when the genomic sequence has changed a lot, but the protein sequence remained the same. From what you explained, I think this is not your case.
coding2genome this is similar to protein2genome, but starting from the cDNA. As you mentioned, you could use protein2genome directly.
cdna2genome I think this is the correct one for you, because you are aligning sequences from an organism to its own genome.