Question: Several exonerate options; which to choose?
gravatar for mforthman
20 months ago by
mforthman30 wrote:

I am wanting to identify exon-intron boundaries between transcripts and a genome for a given species using exonerate. For the transcripts, I have both a CDS file and a protein fasta file. I noticed a number of modeling options in exonerate, but honestly I'm not sure which is best for my purpose. I've narrowed the list down to four potential models:

est2genome – This model is similar to the affine:local model, but it also includes intron modelling on the target sequence to allow alignment of spliced to unspliced coding sequences for both forward and reversed genes. This is similar to the alignment models used in programs such as EST_GENOME and sim4.

protein2genome – This model allows alignment of a protein sequence to genomic DNA. This is similar to the protein2dna model, with the addition of modelling of introns and intron phases. This model is simliar to those used by genewise.

coding2genome – This is similar to the est2genome model, except that the query sequence is translated during comparison, allowing a more sensitive comparison.

cdna2genome – This combines properties of the est2genome and coding2genome models, to allow modeling of whole cDNA where a central coding region can be flanked by non-coding UTRs. When the CDS start and end is known, it may be specified using the --annotation option (see below) to permit only the correct coding region to appear in the alignemnt.

I don't necessarily see a need for coding2genome since I have query sequences already translated (i.e., aa fasta file), in which case I could go protein2genome. I'm not sure what the query input should be for est2genome or cdna2genome, but would it be faster/easier and just as accurate to use the CDS file to query against the genome with either of these two programs?

EDIT: Just in case it is useful information, I want to then use the exon-intron boundary data outputted by exonerate to compare exons from this species' transcripts to transcriptomes of other species with reciprocal best hits (blastn) or using blastx.

ADD COMMENTlink modified 20 months ago by Giovanni M Dall'Olio25k • written 20 months ago by mforthman30
gravatar for Giovanni M Dall'Olio
20 months ago by
London, UK
Giovanni M Dall'Olio25k wrote:
  • est2genome definitely no, as it is related to a technology that is not being used anymore (ESTs). Historically, exonerate became famous because it was good at aligning short sequences (ESTs) to a genome, correctly modeling the exon-intron boundaries and managing introns, which are very long gaps.

  • protein2genome this is more frequently used when you are searching for orthologues in two distantly related species. For example, you have the sequence of a protein in human, and you want to identify where the same gene is encoded in mouse. The protein sequence get retro-translated to all the possible dna sequences (remember that some aminoacids are encoded by more than one codon), and then aligned. This allows to find very distant orthologies, e.g. when the genomic sequence has changed a lot, but the protein sequence remained the same. From what you explained, I think this is not your case.

  • coding2genome this is similar to protein2genome, but starting from the cDNA. As you mentioned, you could use protein2genome directly.

  • cdna2genome I think this is the correct one for you, because you are aligning sequences from an organism to its own genome.

ADD COMMENTlink modified 20 months ago • written 20 months ago by Giovanni M Dall'Olio25k

Thank you for the clarification! That was very helpful.

ADD REPLYlink written 20 months ago by mforthman30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1243 users visited in the last hour