Question

TPM to logFC and pvalues

0

Entering edit mode

6.8 years ago

Spacebio ▴ 200

Hello,

I am new in this kind of analysis and I have a .csv file containing RNA-Seq data from different cell lines (with at least 3 replicates) normalised to TPM already, unfortunately I cannot access to the raw counts.

      Symbol       ID              C1    C2     C3      D1      D2            D3          D4 
1     TSPAN6 ENSG00000000003.13 133.95 132.07 64.47   54.85   53.65        47.87        56.37
2       TNMD  ENSG00000000005.5  10.39   3.47  1.11    0.58    1.74         0.36         1.68
3       DPM1 ENSG00000000419.11  67.67 124.98 33.02    8.35   12.95        12.31        13.33
4      SCYL3 ENSG00000000457.12   2.59   1.40  2.61    5.03    4.70         2.98         3.71
5   C1orf112 ENSG00000000460.15  12.32  46.18 16.49   19.54   19.20        11.72         8.55
6        FGR ENSG00000000938.11   0.00   0.00  0.04    0.36    0.08         0.00         0.00

So my question is: Is there a way I can follow to obtain the logFC, p-values, t-values and padj starting from this .csv file? I read about DESeq, DESeq2, EdgeR, limma and it looks like if all the R packages would ask for the raw counts. I would like to perform a Differential Expression Analysis.

Any suggestions about how to start? Any help is very appreciated.

rna-seq R tpm DataAnalysis • 11k views

ADD COMMENT • link updated 2.7 years ago by kuttibiotech2009 ▴ 30 • written 6.8 years ago by Spacebio ▴ 200

1

Entering edit mode

you should not perform DEA on TPM, you need raw counts, you can a take look at tximport. It can generate the raw read counts. You can in a way get counts from the TPM file. Look at the two links. Link1 and Link2 . Or if you want to do it on your own you need to have effective length and length of the genes.

ADD REPLY • link 6.8 years ago by ivivek_ngs ★ 5.2k

1

Entering edit mode

Just to extend or vchris_ngs comment, here are two quotes on why it is a bad idea to perform differential expression on TPM. The first is from EBSeq github page:

In general, it is not appropriate to perform cross sample comparisons using TPM, FPKM or RPKM without further normalization. [...] Cross-sample TPM/FPKM/RPKM comparisons will be feasible only when no hypothetical DE genes present across samples (Or when assuming the DE genes are sort of 'symmetric' regarding up and down regulation).

The second is from Conesa et al. (2016):

TPMs, which effectively normalize for the differences in composition of the transcripts in the denominator rather than simply dividing by the number of reads in the library, are considered more comparable (than FPKM) between samples of different origins and composition but can still suffer some biases. These must be addressed with normalization techniques such as TMM.

ADD REPLY • link 6.8 years ago by h.mon 35k

0

Entering edit mode

6.8 years ago

mbk0asis ▴ 680

Try to multiply by 100 or larger numbers to make them integers.

ADD COMMENT • link 6.8 years ago by mbk0asis ▴ 680

1

Entering edit mode

how does that give your raw read counts?? It does not

ADD REPLY • link 6.8 years ago by ivivek_ngs ★ 5.2k

1

Entering edit mode

Multiply by 100. Is this kind of joke ? I can also ask you remove the numbers after the decimal point (Just a joke :-p).

ADD REPLY • link 6.8 years ago by EagleEye 7.5k

score 4 · Accepted Answer · 2017-07-27

4

Entering edit mode

6.8 years ago

EagleEye 7.5k

Hi, I assume you have to find differential expression between two cell lines (Cx and Dx groups). Since you need logFC and Pvalue, this R code can work. And you can use obtained matrix (mysample) to calculate FDR of your interest.

mysample <- read.table("./mymatrix.csv", sep=",", header=TRUE)

for(i in 2:nrow(mysample))
{

group1 <- mysample[i,grep('^C',names(mysample))]
group2 <- mysample[i,grep('^D',names(mysample))]

group1_avg <- sum(group1)/length(group1)
group2_avg <- sum(group2)/length(group2)

mysample[i, "C_avg"] <- group1_avg
mysample[i, "D_avg"] <- group2_avg

logfc <- abs(log2(group1_avg/group2_avg))


mysample[i, "LogFC"] <-  logfc


temp_data1 <- data.frame(values=c(group1,group2),vars = rep(c("C","D"), times = c(length(group1),length(group2))))
vars1 = c(rep(c("C","D"), times = c(length(group1),length(group2))))
values1=c(t(group1),t(group2))

x <- t.test(values1 ~ vars1, data = temp_data1)
pv <- c(x$p.value)

y <- wilcox.test(values1 ~ vars1, data = temp_data1)

wilcox <- c(y$p.value)

mysample[i, "TTEST"] <-  pv
mysample[i, "WilcoxTest"] <-  wilcox


}

write.table(mysample,"diffExp_res_TPM.txt", sep="\t", quote=FALSE, row.name=FALSE)

ADD COMMENT • link 6.8 years ago by EagleEye 7.5k

1

Entering edit mode

t.test is extremely under-powered for RNA-seq Diff Exp and is not recommended. You may get no or very few DEGs by using t.test See Figure 4 https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29

One would be better off by running diff Exp by proper s/w specialized for RNAseq (DESeq2, EdgeR), even if the data is normalized. They can model the count statistics much better than what a normal distribution can handle.

ADD REPLY • link 6.7 years ago by Santosh Anand 5.7k

0

Entering edit mode

Hi,

True I also recommend using the specific statistical tools designed for RNAseq differential expression. But in case of OP only normalized expression values are available. I guess the possibilities here is to either TPM to be reverted to original raw counts to use available statistical approaches OR just use Log Fold change ranking.

ADD REPLY • link 6.7 years ago by EagleEye 7.5k

0

Entering edit mode

I get this error:

Error in FUN(X[[i]], ...) : 
  only defined on a data frame with all numeric variables

I think it has to do with the matrix containing the data, no?

ADD REPLY • link 6.8 years ago by Spacebio ▴ 200

0

Entering edit mode

I assume it might be because of 'NaN', 'Inf' values in the matrix. This can be rectified if you can use small matrix containing only numbers and check how the code works (I made the code as a quick fix).

ADD REPLY • link 6.8 years ago by EagleEye 7.5k

0

Entering edit mode

The data contains 60575 gene values without 'NaN' but some '0.0000'. I thought it might be the 'Symbol' & 'ID' columns, should I delete those columns?

ADD REPLY • link 6.8 years ago by Spacebio ▴ 200

0

Entering edit mode

Try converting your file TAB-delimited from CSV and change the first line of code to,

mysample <- read.table("./mymatrix.tab", sep="\t", header=TRUE)

ADD REPLY • link 6.8 years ago by EagleEye 7.5k

0

Entering edit mode

Yes, it was the Symbol and ID column. I removed it and now it works perfectly!! Thank you so much for your help!! :-D

ADD REPLY • link 6.8 years ago by Spacebio ▴ 200

0

Entering edit mode

Great... Good luck :-)

ADD REPLY • link 6.8 years ago by EagleEye 7.5k

0

Entering edit mode

The same error i am getting what should i do.. please help

ADD REPLY • link 2.7 years ago by kuttibiotech2009 ▴ 30

0

Entering edit mode

Dear @EagleEye, I am using your script because I asked kind of the same question. But cam you pls tell me what you meant with:

temp_data1 <- data.frame(values=c(group1,group2),vars = rep(c("C","D"), times = c(length(group1),length(group2))))
vars1 = c(rep(c("C","D"), times = c(length(group1),length(group2))))
values1=c(t(group1),t(group2))

because I get an error each time I run the script. Thank you in advance, V.

ADD REPLY • link 6.7 years ago by Vania ▴ 10

0

Entering edit mode

Unless you tell the exact error it is difficult to rectify.

ADD REPLY • link 6.7 years ago by EagleEye 7.5k

0

Entering edit mode

This error:

Error in data.frame(values = c(group1, group2), vars = rep(c("C", "D"), :  arguments imply differing number of rows: 54580, 7

and I just read Gordon Smyth wrote this comment and I was wondering if this approach you suggest is similar to the one he is proposing.

ADD REPLY • link 6.7 years ago by Vania ▴ 10

0

Entering edit mode

This a link to Gordon Smyth's answer about TPM data.

ADD REPLY • link 6.7 years ago by Spacebio ▴ 200

0

Entering edit mode

Since we rely a lot on blogs rather than paper take a look at this blog. You should not use normalized counts for downstream tools, mostly they have internal normalization, reading blogs is good but read benchmarking papers as well. World of computational biology specially RNA-Seq is very confusing and a lot of people working on translational lab does not either understand most of the analytical flows owing to less knowledge of statistics/methodologies or they do not get support at work. So we rely on blogs but still, you should read the published tools/ their methods/ and mostly the papers that benchmark the quantification methods and the DE tools. TPM stats are fine for projection and visualization or when you are using them to reduce your gene sets for validation purpose or even clinical testing. But not for DE. Last 2 years there are plenty of DE benchmarking. Go through them. Sit with a statistician who works on method development. If not ask the author's of those tools directly. Most of them are kind enough to reply.

ADD REPLY • link 6.7 years ago by ivivek_ngs ★ 5.2k

1

Entering edit mode

Adding to it, How not to perform a differential expression analysis

ADD REPLY • link 6.7 years ago by EagleEye 7.5k

0

Entering edit mode

I was expecting someone to counter it with this Blog. Today's morning debate on Salmon and Kallisto is fun. I just want OP to understand it's about count data to perform DE analysis with standard tools and rather relying on blog better to read papers first. If at all there are errors while using tools blogs will be helpful but for conceptual flows it is better to read the papers, query with the developers and then here if no help is achieved.

ADD REPLY • link 6.7 years ago by ivivek_ngs ★ 5.2k