get coordinates for a list of genes
0
0
Entering edit mode
10 months ago
robinycfang ▴ 10

I have a table with genes and their fold change (FC):

gene_id TPM_n   TPM_t   FC  SYMBOL
ENSG00000000003 17.01   12.22   0.734036646 TSPAN6
ENSG00000000419 9.37    19.25   1.952748312 DPM1

and the gtf annotation file. I want the following:

chromosome start end gene FC
1 1659746 1661263 TSPAN6 0.734036646

etc...

Tried to write a python script, but too many iterations...too slow. Any idea? Thanks!

RNA-Seq gene • 249 views
ADD COMMENT
0
Entering edit mode

Try biomart in R and bioservices in python

ADD REPLY
0
Entering edit mode

Tried to write a python script, but too many iterations...too slow.

If you show us the code we might be able to help.

ADD REPLY
0
Entering edit mode
import pandas as pd

import re


diff_df = pd.read_csv('DiffGenes.csv')
clean = ''

with open('gencode.v32.annotation.gtf', 'r') as gtf:
    line = True
    while line:
        line = gtf.readline()
        if line.startswith('#') is False:
            gene = re.findall('gene_name \"(.*?)\";', line)[0]
            try:
                FC = diff_df[diff_df['SYMBOL'] == gene].iloc[0, 3]
                chrom = 'hs' + line.split('\t')[0]
                start = line.split('\t')[3]
                end = line.split('\t')[4]
                clean += chrom + '\t' + start + '\t' + end + '\t' + str(FC) + '\n'
            except IndexError:
                continue

with open('fc_circos.txt' , 'w') as w:
    w.write(clean)         

This is my super slow python code.
ADD REPLY

Login before adding your answer.

Traffic: 2916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6