Question: Improving reference gtf annotation via merging with alternative gtf annotations - how?
gravatar for epaminonda
6 weeks ago by
United Kingdom
epaminonda10 wrote:


I am looking at the Ensembl V100 gtf annotation file for one model animal of interest and found the annotation lacks a number of important genes, most crucially human homologues. Interestingly, alongside the reference annotation for this species a few alternative annotations are available in gtf format.

I can find some of the missing genes in the alternative annotation and was wondering what is the best way to merge the gtfs to create an 'improved' gtf annotation for the species. I am not particularly fussed about differences in exons, intron chains and transcript, but rather in the completeness of the protein coding gene annotation landscape.

I would imagine one way to doing this is manually, via a script using unique gene signatures in hash keys. Is there anything out there possibly doing this better?

So far, I've only found 'StringTie' and 'GffCompare'. These appear to be both unsuitable for the task, because they discard all gene level information in the input and only retain 'exon' and 'transcript' info.

Thanks in advance!

annotation gene merge gff3 gtf • 127 views
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by epaminonda10
gravatar for Juke34
6 weeks ago by
Juke344.4k wrote:

You can try from AGAT

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Juke344.4k

Hi, thanks so much for this - I've installed it and run it and it seems it does exactly what I'm after.

I was just wondering whether you're one of the developers or maintainers and could ask you a few follow up questions?

The main one would be, it seems the tool returns a gff3 containing the union of the features from the inputs. So if, say gene FOO1 is in input1_reference.gff but not input2.gff, output.gff will contain exactly one 'gene' line referring to gene FOO1. However if both input1_reference.gff and input2.gff contain an entry for gene FOO1, output.gff seems to contain 2 lines, eg:

X      ensembl gene    23629507        23772048        .       -       .       ID=ENS01;gene_biotype=protein_coding;gene_id=ENS01;gene_name=FOO1;gene_source=ensembl;gene_version=1
X      ensembl gene    23711780        23871846        .       -       .       ID=ENS02;gene_biotype=protein_coding;gene_id=ENS02;gene_name=FOO1;gene_source=ensembl;gene_version=5

In situations like the one above, is there a way to get AGAT to only retain the entry from the 'reference' input (as I assume the reference is in general better curated than other annotations).

I guess my question boils down to - is there a way to specify a 'dominant' input annotation and a few supplementary ones?


ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by epaminonda10

When two loci overlap in their CDS (or exon if no CDS) and they are from the same level2 type (e.g. mRNA, tRNA, ncRNA), they are merged in one locus. If the mRNA is identical it will be discarded if not it will be hold as an isoform with all sub features. The top feature retained (here the gene), will be the one from the first file I guess... but I should investigate the code to be sure.
EDIT - I'm sorry but the top feature is randomly picked. It is something that might be improved. Please open an issue in the GitHub. I might look at it in the future.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Juke344.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1377 users visited in the last hour