Improving reference gtf annotation via merging with alternative gtf annotations - how?
Entering edit mode
2.2 years ago
epaminonda ▴ 10


I am looking at the Ensembl V100 gtf annotation file for one model animal of interest and found the annotation lacks a number of important genes, most crucially human homologues. Interestingly, alongside the reference annotation for this species a few alternative annotations are available in gtf format.

I can find some of the missing genes in the alternative annotation and was wondering what is the best way to merge the gtfs to create an 'improved' gtf annotation for the species. I am not particularly fussed about differences in exons, intron chains and transcript, but rather in the completeness of the protein coding gene annotation landscape.

I would imagine one way to doing this is manually, via a script using unique gene signatures in hash keys. Is there anything out there possibly doing this better?

So far, I've only found 'StringTie' and 'GffCompare'. These appear to be both unsuitable for the task, because they discard all gene level information in the input and only retain 'exon' and 'transcript' info.

Thanks in advance!

gtf gff3 annotation gene merge • 953 views
Entering edit mode
2.2 years ago
Juke34 ★ 7.2k

You can try from AGAT

Entering edit mode

Hi, thanks so much for this - I've installed it and run it and it seems it does exactly what I'm after.

I was just wondering whether you're one of the developers or maintainers and could ask you a few follow up questions?

The main one would be, it seems the tool returns a gff3 containing the union of the features from the inputs. So if, say gene FOO1 is in input1_reference.gff but not input2.gff, output.gff will contain exactly one 'gene' line referring to gene FOO1. However if both input1_reference.gff and input2.gff contain an entry for gene FOO1, output.gff seems to contain 2 lines, eg:

X      ensembl gene    23629507        23772048        .       -       .       ID=ENS01;gene_biotype=protein_coding;gene_id=ENS01;gene_name=FOO1;gene_source=ensembl;gene_version=1
X      ensembl gene    23711780        23871846        .       -       .       ID=ENS02;gene_biotype=protein_coding;gene_id=ENS02;gene_name=FOO1;gene_source=ensembl;gene_version=5

In situations like the one above, is there a way to get AGAT to only retain the entry from the 'reference' input (as I assume the reference is in general better curated than other annotations).

I guess my question boils down to - is there a way to specify a 'dominant' input annotation and a few supplementary ones?


Entering edit mode

When two loci overlap in their CDS (or exon if no CDS) and they are from the same level2 type (e.g. mRNA, tRNA, ncRNA), they are merged in one locus. If the mRNA is identical it will be discarded if not it will be hold as an isoform with all sub features. The top feature retained (here the gene), will be the one from the first file I guess... but I should investigate the code to be sure.
EDIT - I'm sorry but the top feature is randomly picked. It is something that might be improved. Please open an issue in the GitHub. I might look at it in the future.


Login before adding your answer.

Traffic: 1844 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6