How To Correctly Recover The Reported Counts Of Known Genes By Parsing Ensembl Gtf File?
1
1
Entering edit mode
11.8 years ago
Alby ▴ 90

I was wondering if it is possible to recover the reported counts of known genes, novel genes, exons, etc. in ensembl species information page by parsing gtf file?

For example of Mus musculus, its ensembl page (http://useast.ensembl.org/Mus_musculus/Info/StatsTable?db=core ) shows the following:

Gene counts

Known genes:    21,886
Novel genes:    531
Putative genes:    290
Pseudogenes:    5,482
RNA genes:    7,541
Immunoglobulin/T-cell receptor gene segments:    481
Gene exons:    416,230
Gene transcripts:    97,639

I downloaded the gene annotation file gtf from ftp://ftp.ensembl.org/pub/release-67/gtf/mus_musculus/

The version is identical.

I was able to correctly recover the gene transcripts

$cat Mus_musculus.NCBIM37.67.gtf | awk '{print $12}' | tr -d ';"' | sort | uniq | wc -l
97639

But that's all. For all the other counts such as Known genes, novel genes, putative genes, etc. I couldn't recover the reported counts.

In attempting to count the exons, I tried the following command

$cat Mus_musculus.NCBIM37.67.gtf| awk '$3~/exon/ {print $3}' | wc -l
 689492

which is the overestimation.

I also tried to count the gene counts with the following

$cat Mus_musculus.NCBIM37.67.gtf| awk '{print $10}' | tr -d ';"' | sort | uniq | wc -l
37991

which is also an overestimation. Even if you add known genes + novel genes, putative genes, pseudogenes + RNA genes + gene segements, you get 36211, which is different from the result of the parsing.

What am I doing wrong? Thank you in advance

gtf ensembl annotation next-gen • 2.2k views
ADD COMMENT
2
Entering edit mode
11.8 years ago
JC 13k

The problem is you have duplicate information, each feature annotated has a description of type of exon and gene/transcripts ids. It's better to use a former parser and don't reinvent the wheel. Try biopython: http://biopython.org/wiki/GFF_Parsing or any other parser.

ADD COMMENT

Login before adding your answer.

Traffic: 3003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6