Question: Difference between Ensembl annotation GTF and GFF3 files
2
gravatar for colin.kern
5.6 years ago by
colin.kern950
United States
colin.kern950 wrote:

When downloading the annotation for a genome from Ensembl, there's a GTF and a GFF3 file available. When reading the README files for these two, I'm having trouble determining if these are exactly the same information just in different formats, or if there's a difference in the actual annotations between the two. The wording makes it sound like GFF3 file might include some non-gene features that aren't included in the GTF, and that possibly they have different requirements for evidence to include a gene in each of the files. Does anyone know exactly what the differences are?

annotation ensembl genome • 4.7k views
ADD COMMENTlink modified 10 months ago by Juke345.2k • written 5.6 years ago by colin.kern950

Did you check this. You will get a primary idea of what these file types contain. 

ADD REPLYlink written 5.6 years ago by venu6.8k

That's describing the differences in the formats, which I'm very familiar with. What I'm asking about is whether the Ensembl genome annotations in the two formats contain the exact same gene and feature sets, or if there's a difference in what is included in each. It's a question about Ensembl's data procedures, not the file formats.

ADD REPLYlink written 5.6 years ago by colin.kern950

Isn't this answerable from downloading relevant pairs of files and comparing?

ADD REPLYlink written 5.6 years ago by Alex Reynolds31k

It's not trivial, but doable, to determine whether they're the same. If they're different, I'm not sure how I'd discern what criteria Ensembl used to generate the two.

ADD REPLYlink written 5.6 years ago by colin.kern950

I don't know if this case is the norm or the exception, but while working with S.cerevisiae annotation I noticed that GFF3 had entries for whole chromosomes as a feature, while GTF did not.

ADD REPLYlink written 10 months ago by almenal990
1
gravatar for Juke34
10 months ago by
Juke345.2k
Sweden
Juke345.2k wrote:

Tehy do not contains same feature types (3rd column).
e.g. on Homo_sapiens.GRCh38.98.chr.gtf (awk '{print $3}' Homo_sapiens.GRCh38.98.chr.gtf | sort -u):

CDS
Selenocysteine
exon
five_prime_utr
gene
start_codon
stop_codon
three_prime_utr
transcript

Homo_sapiens.GRCh38.98.chr.gff:

CDS
C_gene_segment
D_gene_segment
GRCh38.p13
J_gene_segment
V_gene_segment
biological_region
chromosome
exon
five_prime_UTR
gene
lnc_RNA
mRNA
miRNA
ncRNA
ncRNA_gene
pseudogene
pseudogenic_transcript
rRNA
scRNA
scaffold
snRNA
snoRNA
tRNA
three_prime_UTR
unconfirmed_transcript
vaultRNA_primary_transcript

You will have to play with the biotype attribute of the 9th column of your GTF file to see if your transcript code for mRNA,miRNA,ncRNA etc. While in the GFF it is directly described in the feature type column (3rd column).

GTF format is much more constraint when it comes to feature types. See here for more details about the two formats.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Juke345.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1709 users visited in the last hour
_