Question: extract only geneID and gene symbol from GTF file
gravatar for catherine12243
5.4 years ago by
United States
catherine12243140 wrote:

Hi Everyone,

I want to get geneID(gene_id) and its associated gene symbol(gene_name) from a gtf file into a file that can be read into R.

my gtf file looks like this:


chr1    HAVANA  gene    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "RP23-271O17.1"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "RP23-271O17.1"; level 2; havana_gene "OTTMUSG00000049935.1";

chr1    HAVANA  transcript      3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "RP23-271O17.1"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "RP23-271O17.1-001"; level 2; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";


Can anyone tell me how to write a little script (preferably  python) to do it?

Any idea would be appreciated!



python gtf genome • 15k views
ADD COMMENTlink modified 2.1 years ago by Min Dai160 • written 5.4 years ago by catherine12243140
gravatar for Alex Reynolds
5.4 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

One quick and dirty way to extract attributes:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    attr = dict(item.strip().split(' ') for item in line.split('\t')[8].strip('\n').split('') if item)
    sys.stdout.write("%s\n" % (attr['gene_id'].strip('\"') + '\t' + attr['gene_name'].strip('\"')))

To use:

$ ./ < foo.gtf > answer.txt

The result with your test input:

$ more answer.txt
​ENSMUSG00000102693.1    RP23-271O17.1
ENSMUSG00000102693.1    RP23-271O17.1

As mentioned, this is quick and dirty and does no error checking or input validation. Consider adding a try..except block to handle the case where an attribute field is missing a value (or the attribute field is missing altogether).

ADD COMMENTlink modified 3.4 years ago • written 5.4 years ago by Alex Reynolds30k
gravatar for jaro.slamecka
3.1 years ago by
NIH National Center for Advancing Translational Sciences
jaro.slamecka90 wrote:

For those who want to use R directly, there is indeed a fast and efficient way to do it, using package "refGenome", first create and empty refGenome object, then read the GTF file in and finally subset the data.frame to only keep gene IDs and symbols

gtf = ensemblGenome()
read.gtf(gtf, filename="Mus_musculus.GRCm38.89.gtf")
genes = gtf@ev$genes[ ,c("gene_id","gene_name")]

Edit: At this point, it is true that refGenome is no longer available, here is a fast alternative:

gtf.file = "/path/to/Homo_sapiens.GRCh38.100.gtf" = rtracklayer::import(gtf.file) # creates a GRanges object
gtf.df =
genes = unique(gtf.df[ ,c("gene_id","gene_name")])
fwrite(genes, file="gene_ID.gene_name.txt", sep="\t")

(line 3 above looks like it's missing the opening bracket on, not sure how to fix it)

ADD COMMENTlink modified 3 months ago • written 3.1 years ago by jaro.slamecka90

This package was removed from CRAN on the 4th March 2019 (

ADD REPLYlink written 19 months ago by James Reeve100

The link shows that this package is still available. I can install this package today (07/24/2019)

ADD REPLYlink written 14 months ago by luongthang19080

How can I extract the "genes" with the subset of data in csv/tab format? thanks

ADD REPLYlink modified 11 months ago • written 11 months ago by kevin939270

In R

write.csv(genes, file="genes.csv", sep=",", quote=FALSE)

Use sep="\t" for tab delimited

ADD REPLYlink written 5 months ago by kausbiowiz0
gravatar for EagleEye
5.4 years ago by
EagleEye6.7k wrote:

This should work,

Ensembl GTF:

zcat Homo_sapiens.GRCh38.94.gtf.gz | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"a[5]"\t"$7}' | sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_biotype "//'| sed 's/gene_name "//' | sed 's/gene_biotype "//' | sed 's/"//g' | sed 's/ //g' | sed '1igene_id\tGeneSymbol\tChromosome\tClass\tStrand' > Homo_sapiens.GRCh38.94_gene_annotation_table.txt


gene_id GeneSymbol  Chromosome  Class   Strand
ENSG00000223972 DDX11L1 1:11869-14409   transcribed_unprocessed_pseudogene  +
ENSG00000227232 WASH7P  1:14404-29570   unprocessed_pseudogene  -
ENSG00000278267 MIR6859-1   1:17369-17436   miRNA   -
ENSG00000243485 MIR1302-2HG 1:29554-31109   lincRNA +
ENSG00000284332 MIR1302-2   1:30366-30503   miRNA   +
ENSG00000237613 FAM138A 1:34554-36081   lincRNA -
ENSG00000268020 OR4G4P  1:52473-53312   unprocessed_pseudogene  +
ENSG00000240361 OR4G11P 1:57598-64116   transcribed_unprocessed_pseudogene  +
ENSG00000186092 OR4F5   1:65419-71585   protein_coding  +

Gencode GTF:

zcat gencode.v28.annotation.gtf.gz | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"a[2]"\t"$7}' |sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_type "//'| sed 's/gene_name "//' | sed 's/"//g' | awk 'BEGIN{FS="\t"}{split($3,a,"[:-]"); print $1"\t"$2"\t"a[1]"\t"a[2]"\t"a[3]"\t"$4"\t"$5"\t"a[3]-a[2];}' | sed "1i\Geneid\tGeneSymbol\tChromosome\tStart\tEnd\tClass\tStrand\tLength"  > gencode.v28_gene_annotation_table.txt


Geneid  GeneSymbol  Chromosome  Start   End Class   Strand  Length
ENSG00000223972.5    DDX11L1    chr1    11869   14409    transcribed_unprocessed_pseudogene +   2540
ENSG00000227232.5    WASH7P chr1    14404   29570    unprocessed_pseudogene -   15166
ENSG00000278267.1    MIR6859-1  chr1    17369   17436    miRNA  -   67
ENSG00000243485.5    MIR1302-2HG    chr1    29554   31109    lincRNA    +   1555
ENSG00000284332.1    MIR1302-2  chr1    30366   30503    miRNA  +   137
ENSG00000237613.2    FAM138A    chr1    34554   36081    lincRNA    -   1527
ENSG00000268020.3    OR4G4P chr1    52473   53312    unprocessed_pseudogene +   839
ENSG00000240361.2    OR4G11P    chr1    57598   64116    transcribed_unprocessed_pseudogene +   6518
ENSG00000186092.6    OR4F5  chr1    65419   71585    protein_coding +   6166
ADD COMMENTlink modified 22 months ago • written 5.4 years ago by EagleEye6.7k

What should have worked??

ADD REPLYlink written 23 months ago by SmallChess530
gravatar for Tariq Daouda
5.4 years ago by
Tariq Daouda210
IRIC | Institute for Research in Immunology and Cancer
Tariq Daouda210 wrote:


I have written a GTF parser, it is part of pyGeno but you can also download it separately from github. It should make the extraction easier.

from import GTFFile

gtf = GTFFile(gtfFilePath)
for line in gtf :
   print line["gene_id"], line["gene_name"]



ADD COMMENTlink written 5.4 years ago by Tariq Daouda210
gravatar for prasundutta87
2.9 years ago by
prasundutta87380 wrote:

best tool for this..

ADD COMMENTlink written 2.9 years ago by prasundutta87380
gravatar for Min Dai
2.1 years ago by
Min Dai160
Min Dai160 wrote:
awk 'BEGIN{FS="\t";OFS="\t"}{
    print $5-$4+1,$9;
  }}' genes.gtf | tr -d ";\"" | awk 'BEGIN{FS=" ";OFS="\t"}{
      for(i in L){
        print i"\t"L[i];
      }' >genes_transcript_length.txt

I calculated length for both coding and noncoding transcripts.

Refer to:

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Min Dai160
gravatar for Antonio R. Franco
5.4 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.5k wrote:

And why not using R directly.. ?

The gtf file use to be either tabulated or space separated

You can run a read.table to get the gtf into R, and then, you can get rid of the undesired colum

ADD COMMENTlink written 5.4 years ago by Antonio R. Franco4.5k

R is not an easy or fast parsing tool, and GTF attributes are contained within a column and require three levels of parsing per line. A command-line approach is probably better suited to this task.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Alex Reynolds30k

I know perhaps I should make a separate question for this, but please bare with me. What command-line approach can be implemented to for example select only gene_id column in gtf file? I am aware of awk command, but as a newbie I have no clue how I would put this command together. Preferably a one-liner. Thanks!

ADD REPLYlink modified 9 months ago by RamRS30k • written 4.7 years ago by herman.pappoe.4510
$ gtf2bed < foo.gtf | cut -f4 > foo.gene_id_column.txt
ADD REPLYlink written 4.7 years ago by Alex Reynolds30k

Use above mentioned code from github. You will get GTF file in table format starting with geneid, chr_location (gene wise)...And also another file with transcriptid, chr_location (transcript wise)...

Or use this

A: Converting gtf format to bed format

This will give genewise table and a BED file. Both opproaches will work for you.

ADD REPLYlink modified 9 months ago by RamRS30k • written 4.7 years ago by EagleEye6.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2122 users visited in the last hour