Question: Several exonerate options; which to choose?
gravatar for mforthman
4.1 years ago by
mforthman40 wrote:

I am wanting to identify exon-intron boundaries between transcripts and a genome for a given species using exonerate. For the transcripts, I have both a CDS file and a protein fasta file. I noticed a number of modeling options in exonerate, but honestly I'm not sure which is best for my purpose. I've narrowed the list down to four potential models:

est2genome – This model is similar to the affine:local model, but it also includes intron modelling on the target sequence to allow alignment of spliced to unspliced coding sequences for both forward and reversed genes. This is similar to the alignment models used in programs such as EST_GENOME and sim4.

protein2genome – This model allows alignment of a protein sequence to genomic DNA. This is similar to the protein2dna model, with the addition of modelling of introns and intron phases. This model is simliar to those used by genewise.

coding2genome – This is similar to the est2genome model, except that the query sequence is translated during comparison, allowing a more sensitive comparison.

cdna2genome – This combines properties of the est2genome and coding2genome models, to allow modeling of whole cDNA where a central coding region can be flanked by non-coding UTRs. When the CDS start and end is known, it may be specified using the --annotation option (see below) to permit only the correct coding region to appear in the alignemnt.

I don't necessarily see a need for coding2genome since I have query sequences already translated (i.e., aa fasta file), in which case I could go protein2genome. I'm not sure what the query input should be for est2genome or cdna2genome, but would it be faster/easier and just as accurate to use the CDS file to query against the genome with either of these two programs?

EDIT: Just in case it is useful information, I want to then use the exon-intron boundary data outputted by exonerate to compare exons from this species' transcripts to transcriptomes of other species with reciprocal best hits (blastn) or using blastx.

ADD COMMENTlink modified 4.1 years ago by Giovanni M Dall'Olio27k • written 4.1 years ago by mforthman40
gravatar for Giovanni M Dall'Olio
4.1 years ago by
London, UK
Giovanni M Dall'Olio27k wrote:
  • est2genome definitely no, as it is related to a technology that is not being used anymore (ESTs). Historically, exonerate became famous because it was good at aligning short sequences (ESTs) to a genome, correctly modeling the exon-intron boundaries and managing introns, which are very long gaps.

  • protein2genome this is more frequently used when you are searching for orthologues in two distantly related species. For example, you have the sequence of a protein in human, and you want to identify where the same gene is encoded in mouse. The protein sequence get retro-translated to all the possible dna sequences (remember that some aminoacids are encoded by more than one codon), and then aligned. This allows to find very distant orthologies, e.g. when the genomic sequence has changed a lot, but the protein sequence remained the same. From what you explained, I think this is not your case.

  • coding2genome this is similar to protein2genome, but starting from the cDNA. As you mentioned, you could use protein2genome directly.

  • cdna2genome I think this is the correct one for you, because you are aligning sequences from an organism to its own genome.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Giovanni M Dall'Olio27k

Thank you for the clarification! That was very helpful.

ADD REPLYlink written 4.1 years ago by mforthman40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1586 users visited in the last hour