How do I add new columns to my dictionary
0
0
Entering edit mode
4.1 years ago
rcnml16 • 0

Hi everyone,

I am busy transforming a GTF file into a searchable db. First step is to parse this GTF file into a dictionary.

This is a row of the gtf I am parsing:

chr22   refGene exon    24666799        24666951        .       +       .       gene_id "SPECC1L"; transcript_id "NM_015330"; exon_number "1"; exon_id "NM_015330.1"; gene_name "SPECC1L";

The code I have so far will give back column 8 into loose parts.

import sys
import pandas as pd
import re

"""

"""
def parse_gtf(f):
    with open(f, 'r') as f_in:
        for line in f_in:
            info_field_line = line.split("\t")[8]
            ### Delimeter/scheidingsteken ";"
            #print(info_field_line)
            info_field_line_array = info_field_line.rstrip().split(";")
            #print(info_field_line_array)

            ###For each line of your GTF, create a dictionary with this array key ; info " " value : value of this info
            dict1 = {}
            for i in info_field_line_array:
                #print(i)
                ###Just looking for line with "=" character (as key = value)
                #if "," in i:
                ###Left from equal sign is key (Gene.refGene, ExonicFunc.refGene...)
                sp = i.lstrip().split()
                #print(sp)
                if len(sp) > 1:
                    key = sp[0]
                    ###Right from equal sign is value (RBL1,synonymous_SNV...)
                    value = sp[1].strip('"')
                    ###Put them in a dictionary
                    dict1[key] = value
            yield(dict1)


if __name__ == '__main__':
    gtf_file = sys.argv[1]
    gtf_data = parse_gtf(gtf_file)

    for x in gtf_data:
        print(x)

The outcome will look like this:

{'gene_id': 'SPECC1L', 'transcript_id': 'NM_015330', 'exon_number': '1', 'exon_id': 'NM_015330.1', 'gene_name': 'SPECC1L'}

I am trying to add columns by doin this:

#Add the other columns to the dictionary
dict1['chromosome'] = fields[0]
dict1['source'] = fields[1]
dict1['feature'] = fields[2]
dict1['start'] = fields[3]
dict1['end'] = fields[4]

return(dict1)

But it gives an error and says that "fields is not defined". Does somebody know how to do this?

Or can somebody help me to add the columns 0 to 4 to the dictionary? :o Thanks!

GGF GTF parse • 766 views
ADD COMMENT
0
Entering edit mode

I think I just fixed it with:

fields = line.split("\t")
ADD REPLY

Login before adding your answer.

Traffic: 2289 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6