Question

How do I add new columns to my dictionary

0

Entering edit mode

4.1 years ago

rcnml16 • 0

Hi everyone,

I am busy transforming a GTF file into a searchable db. First step is to parse this GTF file into a dictionary.

This is a row of the gtf I am parsing:

chr22   refGene exon    24666799        24666951        .       +       .       gene_id "SPECC1L"; transcript_id "NM_015330"; exon_number "1"; exon_id "NM_015330.1"; gene_name "SPECC1L";

The code I have so far will give back column 8 into loose parts.

import sys
import pandas as pd
import re

"""

"""
def parse_gtf(f):
    with open(f, 'r') as f_in:
        for line in f_in:
            info_field_line = line.split("\t")[8]
            ### Delimeter/scheidingsteken ";"
            #print(info_field_line)
            info_field_line_array = info_field_line.rstrip().split(";")
            #print(info_field_line_array)

            ###For each line of your GTF, create a dictionary with this array key ; info " " value : value of this info
            dict1 = {}
            for i in info_field_line_array:
                #print(i)
                ###Just looking for line with "=" character (as key = value)
                #if "," in i:
                ###Left from equal sign is key (Gene.refGene, ExonicFunc.refGene...)
                sp = i.lstrip().split()
                #print(sp)
                if len(sp) > 1:
                    key = sp[0]
                    ###Right from equal sign is value (RBL1,synonymous_SNV...)
                    value = sp[1].strip('"')
                    ###Put them in a dictionary
                    dict1[key] = value
            yield(dict1)


if __name__ == '__main__':
    gtf_file = sys.argv[1]
    gtf_data = parse_gtf(gtf_file)

    for x in gtf_data:
        print(x)

The outcome will look like this:

{'gene_id': 'SPECC1L', 'transcript_id': 'NM_015330', 'exon_number': '1', 'exon_id': 'NM_015330.1', 'gene_name': 'SPECC1L'}

I am trying to add columns by doin this:

#Add the other columns to the dictionary
dict1['chromosome'] = fields[0]
dict1['source'] = fields[1]
dict1['feature'] = fields[2]
dict1['start'] = fields[3]
dict1['end'] = fields[4]

return(dict1)

But it gives an error and says that "fields is not defined". Does somebody know how to do this?

Or can somebody help me to add the columns 0 to 4 to the dictionary? :o Thanks!

GGF GTF parse • 766 views

ADD COMMENT • link updated 4.1 years ago by Ram 43k • written 4.1 years ago by rcnml16 • 0

0

Entering edit mode

I think I just fixed it with:

fields = line.split("\t")

ADD REPLY • link 4.1 years ago by rcnml16 • 0