Question: How do I add new columns to my dictionary
gravatar for rcnml16
11 months ago by
rcnml160 wrote:

Hi everyone,

I am busy transforming a GTF file into a searchable db. First step is to parse this GTF file into a dictionary.

This is a row of the gtf I am parsing:

chr22   refGene exon    24666799        24666951        .       +       .       gene_id "SPECC1L"; transcript_id "NM_015330"; exon_number "1"; exon_id "NM_015330.1"; gene_name "SPECC1L";

The code I have so far will give back column 8 into loose parts.

import sys
import pandas as pd
import re


def parse_gtf(f):
    with open(f, 'r') as f_in:
        for line in f_in:
            info_field_line = line.split("\t")[8]
            ### Delimeter/scheidingsteken ";"
            info_field_line_array = info_field_line.rstrip().split(";")

            ###For each line of your GTF, create a dictionary with this array key ; info " " value : value of this info
            dict1 = {}
            for i in info_field_line_array:
                ###Just looking for line with "=" character (as key = value)
                #if "," in i:
                ###Left from equal sign is key (Gene.refGene, ExonicFunc.refGene...)
                sp = i.lstrip().split()
                if len(sp) > 1:
                    key = sp[0]
                    ###Right from equal sign is value (RBL1,synonymous_SNV...)
                    value = sp[1].strip('"')
                    ###Put them in a dictionary
                    dict1[key] = value

if __name__ == '__main__':
    gtf_file = sys.argv[1]
    gtf_data = parse_gtf(gtf_file)

    for x in gtf_data:

The outcome will look like this:

{'gene_id': 'SPECC1L', 'transcript_id': 'NM_015330', 'exon_number': '1', 'exon_id': 'NM_015330.1', 'gene_name': 'SPECC1L'}

I am trying to add columns by doin this:

#Add the other columns to the dictionary
dict1['chromosome'] = fields[0]
dict1['source'] = fields[1]
dict1['feature'] = fields[2]
dict1['start'] = fields[3]
dict1['end'] = fields[4]


But it gives an error and says that "fields is not defined". Does somebody know how to do this?

Or can somebody help me to add the columns 0 to 4 to the dictionary? :o Thanks!

ggf parse gtf • 248 views
ADD COMMENTlink modified 11 months ago by Ram32k • written 11 months ago by rcnml160

I think I just fixed it with:

fields = line.split("\t")
ADD REPLYlink written 11 months ago by rcnml160
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1628 users visited in the last hour