Question: How To Correctly Recover The Reported Counts Of Known Genes By Parsing Ensembl Gtf File?
gravatar for Alby
7.5 years ago by
Alby90 wrote:

I was wondering if it is possible to recover the reported counts of known genes, novel genes, exons, etc. in ensembl species information page by parsing gtf file?

For example of Mus musculus, its ensembl page ( ) shows the following:

Gene counts

Known genes:    21,886
Novel genes:    531
Putative genes:    290
Pseudogenes:    5,482
RNA genes:    7,541
Immunoglobulin/T-cell receptor gene segments:    481
Gene exons:    416,230
Gene transcripts:    97,639

I downloaded the gene annotation file gtf from

The version is identical.

I was able to correctly recover the gene transcripts

$cat Mus_musculus.NCBIM37.67.gtf | awk '{print $12}' | tr -d ';"' | sort | uniq | wc -l

But that's all. For all the other counts such as Known genes, novel genes, putative genes, etc. I couldn't recover the reported counts.

In attempting to count the exons, I tried the following command

$cat Mus_musculus.NCBIM37.67.gtf| awk '$3~/exon/ {print $3}' | wc -l

which is the overestimation.

I also tried to count the gene counts with the following

$cat Mus_musculus.NCBIM37.67.gtf| awk '{print $10}' | tr -d ';"' | sort | uniq | wc -l

which is also an overestimation. Even if you add known genes + novel genes, putative genes, pseudogenes + RNA genes + gene segements, you get 36211, which is different from the result of the parsing.

What am I doing wrong? Thank you in advance

gtf ensembl next-gen annotation • 1.7k views
ADD COMMENTlink written 7.5 years ago by Alby90
gravatar for JC
7.5 years ago by
JC9.3k wrote:

The problem is you have duplicate information, each feature annotated has a description of type of exon and gene/transcripts ids. It's better to use a former parser and don't reinvent the wheel. Try biopython: or any other parser.

ADD COMMENTlink written 7.5 years ago by JC9.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1744 users visited in the last hour