I would like to transfer some gene annotations between a previous and updated assembly (for 1 “chromosome,” derived from what was originally a circular BAC). There will be a mix of large and small changes in the underlying sequence.
I think this is difficult because there are exact and non-exact duplications.
For example, let’s say the annotations on the previous version of the sequence look something like this:
GeneA1-1 GeneB1-1 GeneB2-1 GeneA2-1 GeneB2-2 GeneC GeneA1-2 GeneB1-2 GeneB2-3 GeneA2-2 GeneB2-4
In that example, if at all possible, I would like to keep track of GeneB2-2 being between GeneA2-1 and GeneC (and GeneB2-3 being between GeneB1-2 and GeneA2-2, etc.). The “large changes” include deletions, but not inversions (that I am aware of, at least not where I need to transfer annotations). So, you could imagine that possible changes might include a SNP in GeneA2-2, a deletion of GeneB2-3, etc..
My guess is that using Exonerate can help, but I am wondering if the is some parameter combination that can work better in this situation.
For example, I am already using
--bestn 1 to report 1 hit per query (unless there are ties, which there are for the duplicated genes that are 100% matches), as part of the following command:
/opt/exonerate-2.2.0-x86_64/bin/exonerate --bestn 1 --showtargetgff true --model coding2genome --showalignment no $QUERY $TARGET > $RESULT
I have currently condensed the query file (for example, collapse GeneA1-1 and GeneA1-2 into 1 query sequence called GeneA1). However, genes that are not 100% identical can have the same overlapping Exonerate hits (for example, GeneB1-1 and GeneB2-1 may not be 100% identical, but they may be getting each other’s Exonerate hits). I think this might relate to Exonerate defining a different gene structure than used in the original annotation, and/or genes that were 99% identical in the last assembly became 100% identical in the revised assembly.
Also, if I run Exonerate on the exact same sequence that has the current gene annotations, then they are not all identical. For example, a gene with 4 exons in the current annotations might have 3 exons in the Exonerate hit in roughly the same region (again, on the exact same total genomic sequence, as a positive control).
To describe that previous point in another way, I used gffcompare for the output of Exonerate with the FASTA file for the reduced set of unique query sequences on the original positive control sequence and the original reference annotations. That reports a loss of 22.5% of exons (and addition of 7.5% of “novel” exons). So, if at all possible, I would like to see if this could be improved.
It might be hard to see, but here is a view of the tracks in IGV:
The left hand side has a different class of genes that I am not predicting with Exonerate (and out of the 4 clusters will be deleted in the revised version of the sequence, as an example of a "large" deletion). However, there is at least 1 Exonerate hit that looks too big (about 3/4 of the way through the sequence, on the right side of the image) and I hope that gives some sense of how similar the current annotations are to the Exonerate hits.
I can also indirectly run Exonerate through MAKER. That provides some extra parameters that can be changed. However, at least currently, I think the GffCompare metrics are relatively more similar to the reference with Exonerate being run as described above.
I also figured out how to create a custom .chain file for liftOver (and use CrossMap to convert a GFF/GTF file with annotations), which I mention in this other post. However, I am losing 20-25% of the blocks from the original annotation, so I think I would need to change something. Plus, I think Exonerate is helping in terms of keeping track of the components of the gene annotation (all of the queries have at least 1 hit, but there are duplicates and the hits are not always identical to the original annotation).
Does anybody have any suggestions of what might help (either for Exonerate or something else)?