extract only geneID and gene symbol from GTF file
8
1
Entering edit mode
6.2 years ago
catherine ▴ 180

Hi Everyone,

I want to get geneID(gene_id) and its associated gene symbol(gene_name) from a gtf file into a file that can be read into R.

my gtf file looks like this:

    

chr1    HAVANA  gene    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "RP23-271O17.1"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "RP23-271O17.1"; level 2; havana_gene "OTTMUSG00000049935.1";

chr1    HAVANA  transcript      3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "RP23-271O17.1"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "RP23-271O17.1-001"; level 2; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

 

Can anyone tell me how to write a little script (preferably  python) to do it?

Any idea would be appreciated!

 

 

genome gtf python • 18k views
ADD COMMENT
3
Entering edit mode
6.2 years ago

One quick and dirty way to extract attributes:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    attr = dict(item.strip().split(' ') for item in line.split('\t')[8].strip('\n').split('') if item)
    sys.stdout.write("%s\n" % (attr['gene_id'].strip('\"') + '\t' + attr['gene_name'].strip('\"')))

To use:

$ ./extract_gtf_attributes_of_interest.py < foo.gtf > answer.txt

The result with your test input:

$ more answer.txt
​ENSMUSG00000102693.1    RP23-271O17.1
ENSMUSG00000102693.1    RP23-271O17.1

As mentioned, this is quick and dirty and does no error checking or input validation. Consider adding a try..except block to handle the case where an attribute field is missing a value (or the attribute field is missing altogether).

ADD COMMENT
9
Entering edit mode
4.0 years ago
jaro.slamecka ▴ 150

For those who want to use R directly, there is indeed a fast and efficient way to do it, using package "refGenome", first create and empty refGenome object, then read the GTF file in and finally subset the data.frame to only keep gene IDs and symbols

library(refGenome)
gtf = ensemblGenome()
read.gtf(gtf, filename="Mus_musculus.GRCm38.89.gtf")
genes = gtf@ev$genes[ ,c("gene_id","gene_name")]

Edit: At this point, it is true that refGenome is no longer available, here is a fast alternative:

gtf.file = "/path/to/Homo_sapiens.GRCh38.100.gtf"
gtf.gr = rtracklayer::import(gtf.file) # creates a GRanges object
gtf.df = as.data.framegtf.gr)
genes = unique(gtf.df[ ,c("gene_id","gene_name")])
library(data.table)
fwrite(genes, file="gene_ID.gene_name.txt", sep="\t")

(line 3 above looks like it's missing the opening bracket on gtf.gr, not sure how to fix it)

ADD COMMENT
0
Entering edit mode

This package was removed from CRAN on the 4th March 2019 (https://cran.r-project.org/web/packages/refGenome/index.html)

ADD REPLY
0
Entering edit mode

The link shows that this package is still available. I can install this package today (07/24/2019)

ADD REPLY
0
Entering edit mode

How can I extract the "genes" with the subset of data in csv/tab format? thanks

ADD REPLY
0
Entering edit mode

In R

write.csv(genes, file="genes.csv", sep=",", quote=FALSE)

Use sep="\t" for tab delimited

ADD REPLY
3
Entering edit mode
6.2 years ago
EagleEye 7.0k

This should work,

Ensembl GTF:

zcat Homo_sapiens.GRCh38.94.gtf.gz | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"a[5]"\t"$7}' | sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_biotype "//'| sed 's/gene_name "//' | sed 's/gene_biotype "//' | sed 's/"//g' | sed 's/ //g' | sed '1igene_id\tGeneSymbol\tChromosome\tClass\tStrand' > Homo_sapiens.GRCh38.94_gene_annotation_table.txt

Output:

gene_id GeneSymbol  Chromosome  Class   Strand
ENSG00000223972 DDX11L1 1:11869-14409   transcribed_unprocessed_pseudogene  +
ENSG00000227232 WASH7P  1:14404-29570   unprocessed_pseudogene  -
ENSG00000278267 MIR6859-1   1:17369-17436   miRNA   -
ENSG00000243485 MIR1302-2HG 1:29554-31109   lincRNA +
ENSG00000284332 MIR1302-2   1:30366-30503   miRNA   +
ENSG00000237613 FAM138A 1:34554-36081   lincRNA -
ENSG00000268020 OR4G4P  1:52473-53312   unprocessed_pseudogene  +
ENSG00000240361 OR4G11P 1:57598-64116   transcribed_unprocessed_pseudogene  +
ENSG00000186092 OR4F5   1:65419-71585   protein_coding  +

Gencode GTF:

zcat gencode.v28.annotation.gtf.gz | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"a[2]"\t"$7}' |sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_type "//'| sed 's/gene_name "//' | sed 's/"//g' | awk 'BEGIN{FS="\t"}{split($3,a,"[:-]"); print $1"\t"$2"\t"a[1]"\t"a[2]"\t"a[3]"\t"$4"\t"$5"\t"a[3]-a[2];}' | sed "1i\Geneid\tGeneSymbol\tChromosome\tStart\tEnd\tClass\tStrand\tLength"  > gencode.v28_gene_annotation_table.txt

Output:

Geneid  GeneSymbol  Chromosome  Start   End Class   Strand  Length
ENSG00000223972.5    DDX11L1    chr1    11869   14409    transcribed_unprocessed_pseudogene +   2540
ENSG00000227232.5    WASH7P chr1    14404   29570    unprocessed_pseudogene -   15166
ENSG00000278267.1    MIR6859-1  chr1    17369   17436    miRNA  -   67
ENSG00000243485.5    MIR1302-2HG    chr1    29554   31109    lincRNA    +   1555
ENSG00000284332.1    MIR1302-2  chr1    30366   30503    miRNA  +   137
ENSG00000237613.2    FAM138A    chr1    34554   36081    lincRNA    -   1527
ENSG00000268020.3    OR4G4P chr1    52473   53312    unprocessed_pseudogene +   839
ENSG00000240361.2    OR4G11P    chr1    57598   64116    transcribed_unprocessed_pseudogene +   6518
ENSG00000186092.6    OR4F5  chr1    65419   71585    protein_coding +   6166
ADD COMMENT
0
Entering edit mode

What should have worked??

ADD REPLY
1
Entering edit mode
6.2 years ago
Tariq Daouda ▴ 210

Hi,

I have written a GTF parser, it is part of pyGeno but you can also download it separately from github. It should make the extraction easier.

from pyGeno.tools.parsers.GTFTools import GTFFile

gtf = GTFFile(gtfFilePath)
for line in gtf :
   print line["gene_id"], line["gene_name"]

Cheers,

 

ADD COMMENT
1
Entering edit mode
3.7 years ago
prasundutta87 ▴ 570

best tool for this..http://gffutils.readthedocs.io/en/latest/GTF_extract.html

ADD COMMENT
1
Entering edit mode
2.9 years ago
Min Dai ▴ 160
awk 'BEGIN{FS="\t";OFS="\t"}{
  if($3=="exon"){
    print $5-$4+1,$9;
  }}' genes.gtf | tr -d ";\"" | awk 'BEGIN{FS=" ";OFS="\t"}{
    if($6=="p_id"){
      L[$9]+=$1;
    }else{
      L[$7]+=$1;
    }
    }END{
      for(i in L){
        print i"\t"L[i];
      }
      }' >genes_transcript_length.txt

I calculated length for both coding and noncoding transcripts.

Refer to: https://gist.github.com/sp00nman/10372555

ADD COMMENT
0
Entering edit mode
6.2 years ago

And why not using R directly.. ?

The gtf file use to be either tabulated or space separated

You can run a read.table to get the gtf into R, and then, you can get rid of the undesired colum

ADD COMMENT
2
Entering edit mode

R is not an easy or fast parsing tool, and GTF attributes are contained within a column and require three levels of parsing per line. A command-line approach is probably better suited to this task.

ADD REPLY
0
Entering edit mode

I know perhaps I should make a separate question for this, but please bare with me. What command-line approach can be implemented to for example select only gene_id column in gtf file? I am aware of awk command, but as a newbie I have no clue how I would put this command together. Preferably a one-liner. Thanks!

ADD REPLY
1
Entering edit mode
$ gtf2bed < foo.gtf | cut -f4 > foo.gene_id_column.txt
ADD REPLY
1
Entering edit mode

Use above mentioned code from github. You will get GTF file in table format starting with geneid, chr_location (gene wise)...And also another file with transcriptid, chr_location (transcript wise)...

Or use this

A: Converting gtf format to bed format

This will give genewise table and a BED file. Both opproaches will work for you.

ADD REPLY

Login before adding your answer.

Traffic: 1471 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6