How to merge several FPKM files?
0
0
Entering edit mode
5.3 years ago
HZZ0036 ▴ 30

Hi,

I have 8 transcriptome samples. After mapping transcriptome reads to genome and ran cufflinks, I get FPKM results like:

gene_id  trans_255FPKM
arahy.Tifrunner.gnm1.ann1.0002EG    0
arahy.Tifrunner.gnm1.ann1.0008XH    0.0693565
arahy.Tifrunner.gnm1.ann1.0046K4    0.173405
arahy.Tifrunner.gnm1.ann1.005S9P    0.0491059
arahy.Tifrunner.gnm1.ann1.0067NC    0.385907
arahy.Tifrunner.gnm1.ann1.006X4N    0.256161
arahy.Tifrunner.gnm1.ann1.008EVY    0
arahy.Tifrunner.gnm1.ann1.0093F6    0.282329
arahy.Tifrunner.gnm1.ann1.009U6B    0.564484
.......

gene_id  trans_256FPKM
arahy.Tifrunner.gnm1.ann1.0002EG    0
arahy.Tifrunner.gnm1.ann1.0008XH    0.0983234
arahy.Tifrunner.gnm1.ann1.004CIH    2.31641
arahy.Tifrunner.gnm1.ann1.005S9P    0.0493107
arahy.Tifrunner.gnm1.ann1.0067NC    0.457726
arahy.Tifrunner.gnm1.ann1.006X4N    0.36012
arahy.Tifrunner.gnm1.ann1.006XS9    0.709858
arahy.Tifrunner.gnm1.ann1.008EVY    0
arahy.Tifrunner.gnm1.ann1.0093F6    0.475885

.......
gene_id  trans_262FPKM

  ............

......... How to combine these files into one file? If there is no geneX, the FPKM will be 0. I want to get a file like this:

gene_id  trans_255FPKM    trans_256FPKM  ........ trans_262FPKM
arahy.Tifrunner.gnm1.ann1.0002EG    0     0        ............

Thanks in advance.

rna-seq alignment • 1.6k views
ADD COMMENT
0
Entering edit mode

What have you tried already? In which programming / scripting language are you most comfortable?

ADD REPLY
0
Entering edit mode

I have tried Python:

import glob
import pandas as pd
import numpy as np

# path = open(r'C:\Users\Desktop\test')
glob.glob("a*.xlsx")


all_data = pd.DataFrame()
for f in glob.glob("a2.xlsx"):
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)

print (all_data.describe())

status = pd.read_excel("a1.xlsx")

all_data_st = pd.merge(all_data, status, how='left')
all_data_st.head()

print (all_data_st)

But the results are not correct. It only list the genes id in trans_256. Could you please tell me how to improve the script? Thanks.

ADD REPLY
0
Entering edit mode

The results like this:

                           gene_id  trans_256  trans_255
0  arahy.Tifrunner.gnm1.ann1.0002EG   0.000000   0.000000
1  arahy.Tifrunner.gnm1.ann1.0008XH   0.098323   0.069357
2  arahy.Tifrunner.gnm1.ann1.004CIH   2.316410        NaN
3  arahy.Tifrunner.gnm1.ann1.005S9P   0.049311   0.049106
4  arahy.Tifrunner.gnm1.ann1.0067NC   0.457726   0.385907
5  arahy.Tifrunner.gnm1.ann1.006X4N   0.360120   0.256161
6  arahy.Tifrunner.gnm1.ann1.006XS9   0.709858        NaN
7  arahy.Tifrunner.gnm1.ann1.008EVY   0.000000   0.000000
8  arahy.Tifrunner.gnm1.ann1.0093F6   0.475885   0.282329

Some genes id in trans_255 didn't show up.

ADD REPLY
0
Entering edit mode

There may just be no record of them in that sample, as you have implied. If you are confident that your Python script is doing the correct thing, then just convert the NaN values to 0 or just NA. The downstream program that you use may be able to tolerate NA values.

ADD REPLY
0
Entering edit mode

Both cufflinks and FPKM are NOT recommended anymore. What is the aim of your analysis?

ADD REPLY
0
Entering edit mode

Could you please tell me why? I need to get FPKM values to do gene co-expression analysis.

ADD REPLY
0
Entering edit mode

For co-expression analysis like, for example, WGCNA, then FPKM is okay. This is because WGCNA is based on correlation. I assume that you are aiming to use WGCNA (every person who starts in bioinformatics uses it, even though there are better tools available).

ADD REPLY

Login before adding your answer.

Traffic: 1839 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6