Question: Extract features from GFF file
3.8 years ago
Ander wrote:

Hi pals,

I have a genome in GTF format like this:

CP014038.1  GeneMarkS+  CDS 3717912 3718988 .   -   0   "ID=cds0;Parent=gene0;Dbxref=NCB...."

CP014038.1  Genbank gene    631 2190    .   -   .   "ID=gene1;Name=AL538_00010;gbkey=....."

Is there a way to extract the features i want (locus_tag, Name...) from the last column and make it look like this?

CP014038.1  GeneMarkS+  CDS 3717912 3718988 .   -   0   "locus_tag=AL34598_3409; Name=N/A;....."

Thanks for your help Ander

modified 3.7 years ago
3.7 years ago
Alex Reynolds
Seattle, WA USA
Alex Reynolds wrote:

You could use the following GTF-processing skeleton, which extracts the attributes column to a Python dictionary.

#!/usr/bin/env python

import sys
import os

for line in sys.stdin:
    convertedLine = ""
    chomped_line = line.rstrip(os.linesep)
    if chomped_line.startswith('##'):
    elif chomped_line.startswith('track'):
        # skip non-standard use of track keyword by Ensembl 
        elems = chomped_line.split('\t')
        cols = dict()
            cols['seqname'] = elems[0].lstrip(' ') # strip leading whitespace
            cols['source'] = elems[1]
            cols['feature'] = elems[2]
            cols['start'] = int(elems[3])
            cols['end'] = int(elems[4])
            cols['score'] = elems[5]
            cols['strand'] = elems[6]
            cols['frame'] = elems[7]
            cols['attributes'] = elems[8].rstrip(' ') # strip trailing whitespace
        except IndexError as ie:
            sys.stderr.write("[%s] - Error: Input appears to be missing GTF-specific fields (check that your input data is GTF-formatted)\n" % (sys.argv[0]))

            cols['comments'] = elems[9]
        except IndexError as ie:
            cols['comments'] = None

        attributes = dict(item.strip().split(' ') for item in cols['attributes'].split(';') if item)

        # do stuff with attributes

You could filter out key-value pairs, process certain keys, or rewrite key-value pairs in some desired order, etc.

written 3.7 years ago by Alex Reynolds

I forgot to close the thread when I managed to get what I was looking for. Thanks anyway, I'll try your aproach next time I need to do this again!!

written 3.7 years ago by Ander
