Question

Problems with calculating gcskew functions in python

0

Entering edit mode

4.7 years ago

H.Chloe ▴ 20

Hello, everyone.

I have some problems with calculating gcskews in python.
My 2 major inputs are fasta file and bed file.
Bed file has columns of gn(0), gene_type(1), gene name(2), chromosome(3), strand(4), num(5), start(6).(These numbers are index numbers in python.) Then I am trying to use some functions which can calculate gcskews of sense and antisense strand from the start site of each gene. The window is 100bp and these are the functions.

import re
import sys
import os

# opening bed file
content= []
with open("gene_info.full.tsv") as new : 
 for line in new :
     content.append(line.strip().split())
content = content[1:]       

def fasta2dict(fil):
     dic = {}
     scaf = ''
     seq = []
     for line in open(fil):
         if line.startswith(">") and scaf == '':
             scaf = line.split(' ')[0].lstrip(">").replace("\n", "")
         elif line.startswith(">") and scaf != '':
             dic[scaf] = ''.join(seq)
             scaf = line.split(' ')[0].lstrip(">").replace("\n", "")
             seq = []
         else:
             seq.append(line.rstrip())
     dic[scaf] = ''.join(seq)
     return dic

dic_file = fasta2dict("full.fa")

# functions for gc skew
def GC_skew_up(strand, loc, seq, window = 100) : # need -1 for index
  values_up = []
  loc = loc - 1
  if strand == "+" :
    sp_up = seq[loc - window : loc]
    g_up = sp_up.count('G') + sp_up.count('g') 
    c_up = sp_up.count('C') + sp_up.count('c') 
    try :
        skew_up = (g_up - c_up) / float(g_up + c_up)
    except ZeroDivisionError:
        skew_up = 0.0
    values_up.append(skew_up)
  elif strand == "-" :
    sp_up = seq[loc : loc + window]
    g_up = sp_up.count('G') + sp_up.count('g') 
    c_up = sp_up.count('C') + sp_up.count('c')
    try :
        skew_up = (c_up - g_up) / float(g_up + c_up)
    except ZeroDivisionError:
        skew_up = 0.0
    values_up.append(skew_up)

  return values_up  

def GC_skew_dw(strand, loc, seq, window = 100) :
  values_dw = []
  loc = loc - 1
  if strand == "+" :
    sp_dw = seq[loc : loc + window]
    g_dw = sp_dw.count('G') + sp_dw.count('g') 
    c_dw = sp_dw.count('C') + sp_dw.count('c') 
    try :
        skew_dw = (g_dw - c_dw) / float(g_dw + c_dw)
    except ZeroDivisionError:
        skew_dw = 0.0
    values_dw.append(skew_dw)
  elif strand == "-" :
    sp_dw = seq[loc - window : loc]
    g_dw = sp_dw.count('G') + sp_dw.count('g') 
    c_dw = sp_dw.count('C') + sp_dw.count('c')
    try :
        skew_dw = (c_dw - g_dw) / float(g_dw + c_dw)
    except ZeroDivisionError:
        skew_dw = 0.0
    values_dw.append(skew_dw)   

  return values_dw

As I said, I want to calculate the gcskews for 100bp of strands from the start site of genes.
Therefore, I made codes that get the chromosome name from the bed file and get the sequence data from the Fasta file.
Then according to gene name and strand information, I expected that codes will find the correct start site and gcskew for 100bp window will be calculated.
However, when I run this code, gcskew of - strand is wrong but + strand is correct. (I got correct gcskew data and I used it.)
Gcskews are different from the correct data, but I don't know what is the problem.
Could anyone tell me what is the problem of this code?
Thanks in advance!

window = 100
gname = []
up = []
dw = []

for match in content :
seq_chr = dic_file[str(match[3])]
if match[4] == "+" :
    strand = match[4]
    new = int(match[6])
    sen_up = GC_skew_up(strand, new, seq_chr, window = 100)
    sen_dw = GC_skew_dw(strand, new, seq_chr, window = 100)
    gname.append(match[2])
    up.append(str(sen_up[0]))
    dw.append(str(sen_dw[0]))

if match[4] == "-" :
    strand = match[4]
    new = int(match[6])
    an_up = GC_skew_up(strand, new, seq_chr, window = 100)
    an_dw = GC_skew_dw(strand, new, seq_chr, window = 100)
    gname.append(match[2])
    up.append(str(an_up[0]))
    dw.append(str(an_dw[0]))

tot = zip(gname, up, dw)

python functions fasta bed • 1.3k views

ADD COMMENT • link 4.7 years ago by H.Chloe ▴ 20

1

Entering edit mode

Since you're already using Bio, I'd encourage you not to reinvent the wheel, as there are already utilities for sliding window GC skew:

https://biopython.org/DIST/docs/api/Bio.SeqUtils-module.html#GC_skew

ADD REPLY • link 4.7 years ago by Joe 21k

0

Entering edit mode

Thank you very much for your answer and sorry for my mistake.
I wrote Bio in this code, but it is not used in this code.
And I think utilities for sliding window GC skew are really good, but I want to calculate the GC skew of genes.
Thanks for your kind advice again!

ADD REPLY • link 4.7 years ago by H.Chloe ▴ 20

0

Entering edit mode

Sorry, I’m not sure I follow. You can use the Bio tools for any DNA, it can be a gene or not, it doesn’t matter.

ADD REPLY • link 4.7 years ago by Joe 21k