Question: extract only geneID and gene symbol from GTF file
0
gravatar for catherine12243
3.8 years ago by
United States
catherine12243100 wrote:

Hi Everyone,

I want to get geneID(gene_id) and its associated gene symbol(gene_name) from a gtf file into a file that can be read into R.

my gtf file looks like this:

    

chr1    HAVANA  gene    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "RP23-271O17.1"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "RP23-271O17.1"; level 2; havana_gene "OTTMUSG00000049935.1";

chr1    HAVANA  transcript      3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "RP23-271O17.1"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "RP23-271O17.1-001"; level 2; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

 

Can anyone tell me how to write a little script (preferably  python) to do it?

Any idea would be appreciated!

 

 

python gtf genome • 8.9k views
ADD COMMENTlink modified 5 months ago by Minstein60 • written 3.8 years ago by catherine12243100
2
gravatar for Alex Reynolds
3.8 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

One quick and dirty way to extract attributes:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    attr = dict(item.strip().split(' ') for item in line.split('\t')[8].strip('\n').split('') if item)
    sys.stdout.write("%s\n" % (attr['gene_id'].strip('\"') + '\t' + attr['gene_name'].strip('\"')))

To use:

$ ./extract_gtf_attributes_of_interest.py < foo.gtf > answer.txt

The result with your test input:

$ more answer.txt
​ENSMUSG00000102693.1    RP23-271O17.1
ENSMUSG00000102693.1    RP23-271O17.1

As mentioned, this is quick and dirty and does no error checking or input validation. Consider adding a try..except block to handle the case where an attribute field is missing a value (or the attribute field is missing altogether).

ADD COMMENTlink modified 22 months ago • written 3.8 years ago by Alex Reynolds27k
4
gravatar for jaro.slamecka
18 months ago by
Mitchell Cancer Institute, University of South Alabama
jaro.slamecka60 wrote:

For those who want to use R directly, there is indeed a fast and efficient way to do it, using package "refGenome", first create and empty refGenome object, then read the GTF file in and finally subset the data.frame to only keep gene IDs and symbols

library(refGenome)
gtf = ensemblGenome()
read.gtf(gtf, filename="Mus_musculus.GRCm38.89.gtf")
genes = gtf@ev$genes[ ,c("gene_id","gene_name")]
ADD COMMENTlink written 18 months ago by jaro.slamecka60
1
gravatar for prasundutta87
15 months ago by
prasundutta87300
prasundutta87300 wrote:

best tool for this..http://gffutils.readthedocs.io/en/latest/GTF_extract.html

ADD COMMENTlink written 15 months ago by prasundutta87300
0
gravatar for EagleEye
3.8 years ago by
EagleEye6.2k
Sweden
EagleEye6.2k wrote:

This should work,

Ensembl GTF:

zcat Homo_sapiens.GRCh38.94.gtf.gz | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"a[5]"\t"$7}' | sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_biotype "//'| sed 's/gene_name "//' | sed 's/gene_biotype "//' | sed 's/"//g' | sed 's/ //g' | sed '1igene_id\tGeneSymbol\tChromosome\tClass\tStrand' > Homo_sapiens.GRCh38.94_gene_annotation_table.txt

Output:

gene_id GeneSymbol  Chromosome  Class   Strand
ENSG00000223972 DDX11L1 1:11869-14409   transcribed_unprocessed_pseudogene  +
ENSG00000227232 WASH7P  1:14404-29570   unprocessed_pseudogene  -
ENSG00000278267 MIR6859-1   1:17369-17436   miRNA   -
ENSG00000243485 MIR1302-2HG 1:29554-31109   lincRNA +
ENSG00000284332 MIR1302-2   1:30366-30503   miRNA   +
ENSG00000237613 FAM138A 1:34554-36081   lincRNA -
ENSG00000268020 OR4G4P  1:52473-53312   unprocessed_pseudogene  +
ENSG00000240361 OR4G11P 1:57598-64116   transcribed_unprocessed_pseudogene  +
ENSG00000186092 OR4F5   1:65419-71585   protein_coding  +

Gencode GTF:

zcat gencode.v28.annotation.gtf.gz | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"a[2]"\t"$7}' |sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_type "//'| sed 's/gene_name "//' | sed 's/"//g' | awk 'BEGIN{FS="\t"}{split($3,a,"[:-]"); print $1"\t"$2"\t"a[1]"\t"a[2]"\t"a[3]"\t"$4"\t"$5"\t"a[3]-a[2];}' | sed "1i\Geneid\tGeneSymbol\tChromosome\tStart\tEnd\tClass\tStrand\tLength"  > gencode.v28_gene_annotation_table.txt

Output:

Geneid  GeneSymbol  Chromosome  Start   End Class   Strand  Length
ENSG00000223972.5    DDX11L1    chr1    11869   14409    transcribed_unprocessed_pseudogene +   2540
ENSG00000227232.5    WASH7P chr1    14404   29570    unprocessed_pseudogene -   15166
ENSG00000278267.1    MIR6859-1  chr1    17369   17436    miRNA  -   67
ENSG00000243485.5    MIR1302-2HG    chr1    29554   31109    lincRNA    +   1555
ENSG00000284332.1    MIR1302-2  chr1    30366   30503    miRNA  +   137
ENSG00000237613.2    FAM138A    chr1    34554   36081    lincRNA    -   1527
ENSG00000268020.3    OR4G4P chr1    52473   53312    unprocessed_pseudogene +   839
ENSG00000240361.2    OR4G11P    chr1    57598   64116    transcribed_unprocessed_pseudogene +   6518
ENSG00000186092.6    OR4F5  chr1    65419   71585    protein_coding +   6166
ADD COMMENTlink modified 3 months ago • written 3.8 years ago by EagleEye6.2k

What should have worked??

ADD REPLYlink written 3 months ago by SmallChess470
0
gravatar for Tariq Daouda
3.8 years ago by
Tariq Daouda200
IRIC | Institute for Research in Immunology and Cancer
Tariq Daouda200 wrote:

Hi,

I have written a GTF parser, it is part of pyGeno but you can also download it separately from github. It should make the extraction easier.

from pyGeno.tools.parsers.GTFTools import GTFFile

gtf = GTFFile(gtfFilePath)
for line in gtf :
   print line["gene_id"], line["gene_name"]

Cheers,

 

ADD COMMENTlink written 3.8 years ago by Tariq Daouda200
0
gravatar for Antonio R. Franco
3.8 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

And why not using R directly.. ?

The gtf file use to be either tabulated or space separated

You can run a read.table to get the gtf into R, and then, you can get rid of the undesired colum

ADD COMMENTlink written 3.8 years ago by Antonio R. Franco4.0k
2

R is not an easy or fast parsing tool, and GTF attributes are contained within a column and require three levels of parsing per line. A command-line approach is probably better suited to this task.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Alex Reynolds27k

I know perhaps I should make a separate question for this, but please bare with me. What command-line approach can be implemented to for example select only gene_id column in gtf file? I am aware of awk command, but as a newbie I have no clue how I would put this command together. Preferably a one-liner. Thanks!

 

 

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by herman.pappoe.4510
1
$ gtf2bed < foo.gtf | cut -f4 > foo.gene_id_column.txt
ADD REPLYlink written 3.1 years ago by Alex Reynolds27k
1

Use above mentioned code from github. You will get GTF file in table format starting with geneid, chr_location (gene wise)...And also another file with transcriptid, chr_location (transcript wise)...

 

Or use this 

A: Converting gtf format to bed format

This will give genewise table and a BED file. Both opproaches will work for you.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by EagleEye6.2k
0
gravatar for ATpoint
15 months ago by
ATpoint13k
Germany
ATpoint13k wrote:

Have a look at this awk-ish tutorial.

ADD COMMENTlink written 15 months ago by ATpoint13k
0
gravatar for Minstein
5 months ago by
Minstein60
Minstein60 wrote:
awk 'BEGIN{FS="\t";OFS="\t"}{
  if($3=="exon"){
    print $5-$4+1,$9;
  }}' genes.gtf | tr -d ";\"" | awk 'BEGIN{FS=" ";OFS="\t"}{
    if($6=="p_id"){
      L[$9]+=$1;
    }else{
      L[$7]+=$1;
    }
    }END{
      for(i in L){
        print i"\t"L[i];
      }
      }' >genes_transcript_length.txt

I calculated length for both coding and noncoding transcripts.

Refer to: https://gist.github.com/sp00nman/10372555

ADD COMMENTlink modified 5 months ago • written 5 months ago by Minstein60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 845 users visited in the last hour