get coordinates for a list of genes

0

Entering edit mode

3.1 years ago

robinycfang ▴ 20

I have a table with genes and their fold change (FC):

gene_id TPM_n   TPM_t   FC  SYMBOL
ENSG00000000003 17.01   12.22   0.734036646 TSPAN6
ENSG00000000419 9.37    19.25   1.952748312 DPM1

and the gtf annotation file. I want the following:

chromosome start end gene FC
1 1659746 1661263 TSPAN6 0.734036646

etc...

Tried to write a python script, but too many iterations...too slow. Any idea? Thanks!

RNA-Seq gene • 669 views

ADD COMMENT • link updated 3.1 years ago by zx8754 11k • written 3.1 years ago by robinycfang ▴ 20

0

Entering edit mode

Try biomart in R and bioservices in python

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

Tried to write a python script, but too many iterations...too slow.

If you show us the code we might be able to help.

ADD REPLY • link 3.1 years ago by WouterDeCoster 47k

0

Entering edit mode

import pandas as pd

import re


diff_df = pd.read_csv('DiffGenes.csv')
clean = ''

with open('gencode.v32.annotation.gtf', 'r') as gtf:
    line = True
    while line:
        line = gtf.readline()
        if line.startswith('#') is False:
            gene = re.findall('gene_name \"(.*?)\";', line)[0]
            try:
                FC = diff_df[diff_df['SYMBOL'] == gene].iloc[0, 3]
                chrom = 'hs' + line.split('\t')[0]
                start = line.split('\t')[3]
                end = line.split('\t')[4]
                clean += chrom + '\t' + start + '\t' + end + '\t' + str(FC) + '\n'
            except IndexError:
                continue

with open('fc_circos.txt' , 'w') as w:
    w.write(clean)         

This is my super slow python code.

ADD REPLY • link 3.1 years ago by robinycfang ▴ 20

Login before adding your answer.