Question: Extract features from GFF file
0
gravatar for Ander
3.8 years ago by
Ander50
Spain
Ander50 wrote:

Hi pals,

I have a genome in GTF format like this:

CP014038.1  GeneMarkS+  CDS 3717912 3718988 .   -   0   "ID=cds0;Parent=gene0;Dbxref=NCB...."

CP014038.1  Genbank gene    631 2190    .   -   .   "ID=gene1;Name=AL538_00010;gbkey=....."

Is there a way to extract the features i want (locus_tag, Name...) from the last column and make it look like this?

CP014038.1  GeneMarkS+  CDS 3717912 3718988 .   -   0   "locus_tag=AL34598_3409; Name=N/A;....."

Thanks for your help Ander

sequence gene genome • 2.4k views
ADD COMMENTlink modified 3.7 years ago by Alex Reynolds31k • written 3.8 years ago by Ander50
2
gravatar for Alex Reynolds
3.7 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

You could use the following GTF-processing skeleton, which extracts the attributes column to a Python dictionary.

#!/usr/bin/env python

import sys
import os

for line in sys.stdin:
    convertedLine = ""
    chomped_line = line.rstrip(os.linesep)
    if chomped_line.startswith('##'):
        pass
    elif chomped_line.startswith('track'):
        # skip non-standard use of track keyword by Ensembl 
        pass
    else:
        elems = chomped_line.split('\t')
        cols = dict()
        try:
            cols['seqname'] = elems[0].lstrip(' ') # strip leading whitespace
            cols['source'] = elems[1]
            cols['feature'] = elems[2]
            cols['start'] = int(elems[3])
            cols['end'] = int(elems[4])
            cols['score'] = elems[5]
            cols['strand'] = elems[6]
            cols['frame'] = elems[7]
            cols['attributes'] = elems[8].rstrip(' ') # strip trailing whitespace
        except IndexError as ie:
            sys.stderr.write("[%s] - Error: Input appears to be missing GTF-specific fields (check that your input data is GTF-formatted)\n" % (sys.argv[0]))
            sys.exit(os.EX_DATAERR)

        try:
            cols['comments'] = elems[9]
        except IndexError as ie:
            cols['comments'] = None

        attributes = dict(item.strip().split(' ') for item in cols['attributes'].split(';') if item)

        # do stuff with attributes

You could filter out key-value pairs, process certain keys, or rewrite key-value pairs in some desired order, etc.

ADD COMMENTlink written 3.7 years ago by Alex Reynolds31k

I forgot to close the thread when I managed to get what I was looking for. Thanks anyway, I'll try your aproach next time I need to do this again!!

ADD REPLYlink written 3.7 years ago by Ander50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1986 users visited in the last hour