Question: TPM to logFC and pvalues
0
gravatar for Spacebio
7 months ago by
Spacebio160
Spacebio160 wrote:

Hello,

I am new in this kind of analysis and I have a .csv file containing RNA-Seq data from different cell lines (with at least 3 replicates) normalised to TPM already, unfortunately I cannot access to the raw counts.

      Symbol       ID              C1    C2     C3      D1      D2            D3          D4 
1     TSPAN6 ENSG00000000003.13 133.95 132.07 64.47   54.85   53.65        47.87        56.37
2       TNMD  ENSG00000000005.5  10.39   3.47  1.11    0.58    1.74         0.36         1.68
3       DPM1 ENSG00000000419.11  67.67 124.98 33.02    8.35   12.95        12.31        13.33
4      SCYL3 ENSG00000000457.12   2.59   1.40  2.61    5.03    4.70         2.98         3.71
5   C1orf112 ENSG00000000460.15  12.32  46.18 16.49   19.54   19.20        11.72         8.55
6        FGR ENSG00000000938.11   0.00   0.00  0.04    0.36    0.08         0.00         0.00

So my question is: Is there a way I can follow to obtain the logFC, p-values, t-values and padj starting from this .csv file? I read about DESeq, DESeq2, EdgeR, limma and it looks like if all the R packages would ask for the raw counts. I would like to perform a Differential Expression Analysis.

Any suggestions about how to start? Any help is very appreciated.

dataanalysis rna-seq tpm R • 667 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by Spacebio160
1

you should not perform DEA on TPM, you need raw counts, you can a take look at tximport. It can generate the raw read counts. You can in a way get counts from the TPM file. Look at the two links. Link1 and Link2 . Or if you want to do it on your own you need to have effective length and length of the genes.

ADD REPLYlink written 7 months ago by vchris_ngs4.3k
1

Just to extend or vchris_ngs comment, here are two quotes on why it is a bad idea to perform differential expression on TPM. The first is from EBSeq github page:

In general, it is not appropriate to perform cross sample comparisons using TPM, FPKM or RPKM without further normalization. [...] Cross-sample TPM/FPKM/RPKM comparisons will be feasible only when no hypothetical DE genes present across samples (Or when assuming the DE genes are sort of 'symmetric' regarding up and down regulation).

The second is from Conesa et al. (2016):

TPMs, which effectively normalize for the differences in composition of the transcripts in the denominator rather than simply dividing by the number of reads in the library, are considered more comparable (than FPKM) between samples of different origins and composition but can still suffer some biases. These must be addressed with normalization techniques such as TMM.

ADD REPLYlink written 7 months ago by h.mon11k
3
gravatar for EagleEye
7 months ago by
EagleEye4.9k
Sweden
EagleEye4.9k wrote:

Hi, I assume you have to find differential expression between two cell lines (Cx and Dx groups). Since you need logFC and Pvalue, this R code can work. And you can use obtained matrix (mysample) to calculate FDR of your interest.

mysample <- read.table("./mymatrix.csv", sep=",", header=TRUE)

for(i in 2:nrow(mysample))
{

group1 <- mysample[i,grep('^C',names(mysample))]
group2 <- mysample[i,grep('^D',names(mysample))]

group1_avg <- sum(group1)/length(group1)
group2_avg <- sum(group2)/length(group2)

mysample[i, "C_avg"] <- group1_avg
mysample[i, "D_avg"] <- group2_avg

logfc <- abs(log2(group1_avg/group2_avg))


mysample[i, "LogFC"] <-  logfc


temp_data1 <- data.frame(values=c(group1,group2),vars = rep(c("C","D"), times = c(length(group1),length(group2))))
vars1 = c(rep(c("C","D"), times = c(length(group1),length(group2))))
values1=c(t(group1),t(group2))

x <- t.test(values1 ~ vars1, data = temp_data1)
pv <- c(x$p.value)

y <- wilcox.test(values1 ~ vars1, data = temp_data1)

wilcox <- c(y$p.value)

mysample[i, "TTEST"] <-  pv
mysample[i, "WilcoxTest"] <-  wilcox


}

write.table(mysample,"diffExp_res_TPM.txt", sep="\t", quote=FALSE, row.name=FALSE)
ADD COMMENTlink modified 7 months ago • written 7 months ago by EagleEye4.9k
1

t.test is extremely under-powered for RNA-seq Diff Exp and is not recommended. You may get no or very few DEGs by using t.test See Figure 4 https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29

Figure 4

One would be better off by running diff Exp by proper s/w specialized for RNAseq (DESeq2, EdgeR), even if the data is normalized. They can model the count statistics much better than what a normal distribution can handle.

ADD REPLYlink modified 7 months ago • written 7 months ago by Santosh Anand3.4k

Hi,

True I also recommend using the specific statistical tools designed for RNAseq differential expression. But in case of OP only normalized expression values are available. I guess the possibilities here is to either TPM to be reverted to original raw counts to use available statistical approaches OR just use Log Fold change ranking.

ADD REPLYlink written 7 months ago by EagleEye4.9k

I get this error:

Error in FUN(X[[i]], ...) : 
  only defined on a data frame with all numeric variables

I think it has to do with the matrix containing the data, no?

ADD REPLYlink written 7 months ago by Spacebio160

I assume it might be because of 'NaN', 'Inf' values in the matrix. This can be rectified if you can use small matrix containing only numbers and check how the code works (I made the code as a quick fix).

ADD REPLYlink modified 7 months ago • written 7 months ago by EagleEye4.9k

The data contains 60575 gene values without 'NaN' but some '0.0000'. I thought it might be the 'Symbol' & 'ID' columns, should I delete those columns?

ADD REPLYlink written 7 months ago by Spacebio160

Try converting your file TAB-delimited from CSV and change the first line of code to,

mysample <- read.table("./mymatrix.tab", sep="\t", header=TRUE)
ADD REPLYlink written 7 months ago by EagleEye4.9k

Yes, it was the Symbol and ID column. I removed it and now it works perfectly!! Thank you so much for your help!! :-D

ADD REPLYlink written 7 months ago by Spacebio160

Great... Good luck :-)

ADD REPLYlink written 7 months ago by EagleEye4.9k

Dear @EagleEye, I am using your script because I asked kind of the same question. But cam you pls tell me what you meant with:

temp_data1 <- data.frame(values=c(group1,group2),vars = rep(c("C","D"), times = c(length(group1),length(group2))))
vars1 = c(rep(c("C","D"), times = c(length(group1),length(group2))))
values1=c(t(group1),t(group2))

because I get an error each time I run the script. Thank you in advance, V.

ADD REPLYlink written 6 months ago by Vania10

Unless you tell the exact error it is difficult to rectify.

ADD REPLYlink written 6 months ago by EagleEye4.9k

This error:

Error in data.frame(values = c(group1, group2), vars = rep(c("C", "D"), :  arguments imply differing number of rows: 54580, 7

and I just read Gordon Smyth wrote this comment and I was wondering if this approach you suggest is similar to the one he is proposing.

ADD REPLYlink written 6 months ago by Vania10

This a link to Gordon Smyth's answer about TPM data.

ADD REPLYlink modified 6 months ago • written 6 months ago by Spacebio160

Since we rely a lot on blogs rather than paper take a look at this blog. You should not use normalized counts for downstream tools, mostly they have internal normalization, reading blogs is good but read benchmarking papers as well. World of computational biology specially RNA-Seq is very confusing and a lot of people working on translational lab does not either understand most of the analytical flows owing to less knowledge of statistics/methodologies or they do not get support at work. So we rely on blogs but still, you should read the published tools/ their methods/ and mostly the papers that benchmark the quantification methods and the DE tools. TPM stats are fine for projection and visualization or when you are using them to reduce your gene sets for validation purpose or even clinical testing. But not for DE. Last 2 years there are plenty of DE benchmarking. Go through them. Sit with a statistician who works on method development. If not ask the author's of those tools directly. Most of them are kind enough to reply.

ADD REPLYlink modified 6 months ago • written 6 months ago by vchris_ngs4.3k
1

Adding to it, How not to perform a differential expression analysis

ADD REPLYlink modified 6 months ago • written 6 months ago by EagleEye4.9k

I was expecting someone to counter it with this Blog. Today's morning debate on Salmon and Kallisto is fun. I just want OP to understand it's about count data to perform DE analysis with standard tools and rather relying on blog better to read papers first. If at all there are errors while using tools blogs will be helpful but for conceptual flows it is better to read the papers, query with the developers and then here if no help is achieved.

ADD REPLYlink written 6 months ago by vchris_ngs4.3k
0
gravatar for mbk0asis
7 months ago by
mbk0asis290
Korea, Republic Of
mbk0asis290 wrote:

Try to multiply by 100 or larger numbers to make them integers.

ADD COMMENTlink written 7 months ago by mbk0asis290
1

how does that give your raw read counts?? It does not

ADD REPLYlink written 7 months ago by vchris_ngs4.3k
1

Multiply by 100. Is this kind of joke ? I can also ask you remove the numbers after the decimal point (Just a joke :-p).

ADD REPLYlink written 7 months ago by EagleEye4.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 657 users visited in the last hour