Question: Difference between Ensembl annotation GTF and GFF3 files
2
gravatar for colin.kern
4.9 years ago by
colin.kern920
United States
colin.kern920 wrote:

When downloading the annotation for a genome from Ensembl, there's a GTF and a GFF3 file available. When reading the README files for these two, I'm having trouble determining if these are exactly the same information just in different formats, or if there's a difference in the actual annotations between the two. The wording makes it sound like GFF3 file might include some non-gene features that aren't included in the GTF, and that possibly they have different requirements for evidence to include a gene in each of the files. Does anyone know exactly what the differences are?

annotation ensembl genome • 4.1k views
ADD COMMENTlink modified 11 weeks ago by Juke344.4k • written 4.9 years ago by colin.kern920

Did you check this. You will get a primary idea of what these file types contain. 

ADD REPLYlink written 4.9 years ago by venu6.6k

That's describing the differences in the formats, which I'm very familiar with. What I'm asking about is whether the Ensembl genome annotations in the two formats contain the exact same gene and feature sets, or if there's a difference in what is included in each. It's a question about Ensembl's data procedures, not the file formats.

ADD REPLYlink written 4.9 years ago by colin.kern920

Isn't this answerable from downloading relevant pairs of files and comparing?

ADD REPLYlink written 4.9 years ago by Alex Reynolds30k

It's not trivial, but doable, to determine whether they're the same. If they're different, I'm not sure how I'd discern what criteria Ensembl used to generate the two.

ADD REPLYlink written 4.9 years ago by colin.kern920

I don't know if this case is the norm or the exception, but while working with S.cerevisiae annotation I noticed that GFF3 had entries for whole chromosomes as a feature, while GTF did not.

ADD REPLYlink written 11 weeks ago by almenal990
0
gravatar for Juke34
11 weeks ago by
Juke344.4k
Sweden
Juke344.4k wrote:

Tehy do not contains same feature types (3rd column).
e.g. on Homo_sapiens.GRCh38.98.chr.gtf (awk '{print $3}' Homo_sapiens.GRCh38.98.chr.gtf | sort -u):

CDS
Selenocysteine
exon
five_prime_utr
gene
start_codon
stop_codon
three_prime_utr
transcript

Homo_sapiens.GRCh38.98.chr.gff:

CDS
C_gene_segment
D_gene_segment
GRCh38.p13
J_gene_segment
V_gene_segment
biological_region
chromosome
exon
five_prime_UTR
gene
lnc_RNA
mRNA
miRNA
ncRNA
ncRNA_gene
pseudogene
pseudogenic_transcript
rRNA
scRNA
scaffold
snRNA
snoRNA
tRNA
three_prime_UTR
unconfirmed_transcript
vaultRNA_primary_transcript

You will have to play with the biotype attribute of the 9th column of your GTF file to see if your transcript code for mRNA,miRNA,ncRNA etc. While in the GFF it is directly described in the feature type column (3rd column).

GTF format is much more constraint when it comes to feature types. See here for more details about the two formats.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Juke344.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1706 users visited in the last hour