Entering edit mode
4.1 years ago
rcnml16
•
0
Hi everyone,
I am busy transforming a GTF file into a searchable db. First step is to parse this GTF file into a dictionary.
This is a row of the gtf I am parsing:
chr22 refGene exon 24666799 24666951 . + . gene_id "SPECC1L"; transcript_id "NM_015330"; exon_number "1"; exon_id "NM_015330.1"; gene_name "SPECC1L";
The code I have so far will give back column 8 into loose parts.
import sys
import pandas as pd
import re
"""
"""
def parse_gtf(f):
with open(f, 'r') as f_in:
for line in f_in:
info_field_line = line.split("\t")[8]
### Delimeter/scheidingsteken ";"
#print(info_field_line)
info_field_line_array = info_field_line.rstrip().split(";")
#print(info_field_line_array)
###For each line of your GTF, create a dictionary with this array key ; info " " value : value of this info
dict1 = {}
for i in info_field_line_array:
#print(i)
###Just looking for line with "=" character (as key = value)
#if "," in i:
###Left from equal sign is key (Gene.refGene, ExonicFunc.refGene...)
sp = i.lstrip().split()
#print(sp)
if len(sp) > 1:
key = sp[0]
###Right from equal sign is value (RBL1,synonymous_SNV...)
value = sp[1].strip('"')
###Put them in a dictionary
dict1[key] = value
yield(dict1)
if __name__ == '__main__':
gtf_file = sys.argv[1]
gtf_data = parse_gtf(gtf_file)
for x in gtf_data:
print(x)
The outcome will look like this:
{'gene_id': 'SPECC1L', 'transcript_id': 'NM_015330', 'exon_number': '1', 'exon_id': 'NM_015330.1', 'gene_name': 'SPECC1L'}
I am trying to add columns by doin this:
#Add the other columns to the dictionary
dict1['chromosome'] = fields[0]
dict1['source'] = fields[1]
dict1['feature'] = fields[2]
dict1['start'] = fields[3]
dict1['end'] = fields[4]
return(dict1)
But it gives an error and says that "fields is not defined". Does somebody know how to do this?
Or can somebody help me to add the columns 0 to 4 to the dictionary? :o Thanks!
I think I just fixed it with: