Question: Fold Change In R
0
gravatar for MarjoryMollusc
23 months ago by
Australia
MarjoryMollusc50 wrote:

Hi guys, Brand new to going through RNA-seq data, and learning on the go. I have been trying to make a volcano plot of some example data from GEO, to determine up-regulated genes in the sample. I am having problems with getting reasonable outputs for my fold-change values, with insanely high values over 5000 and below -10000. I am not sure if this is a result of just the data or something I have been doing myself. The reason for the volcano plot is so I can then determine the upregulated genes, usually done by looking at the genes over expressed (>=1).

I have been getting the average of the expression data for the "diabetic" and "non-diabetic". To determine the fold change I have then just subtracted the average of the diabetics from the average of the health. This is based off an example with a different dataset.

avg.t2d <- apply(exdat[,t2d.index==" diabetic"], 1, mean)
avg.healthy <- apply(exdat[,t2d.index==" non-diabetic"], 1, mean)
foldch <- avg.t2d - avg.healthy

logt2dp <- -log(t2dpvals, base=10)
plot(foldch, logt2dp, pch=1, xlab="Log2 Fold Change (L/S)",
     ylab="-Log10(P-val)", main="Volcano Plot", col = ifelse(adj_pvals<0.05, 'green', 'black'))

Am I missing something big here or am I over thinking the problem?

fold-change rna-seq R • 6.5k views
ADD COMMENTlink modified 23 months ago by seidel6.9k • written 23 months ago by MarjoryMollusc50
1

Use log2 for exploring the fold changes. You could do something like log2(avg.t2d)-log2(avg.healthy)

ADD REPLYlink written 23 months ago by kautilya400

Thanks for the reply! I have tried this already, however I had the warnings that NaNs were produced. Is there any way to cut the data that give that output? Or is that a result of errors somewhere else?

ADD REPLYlink modified 23 months ago • written 23 months ago by MarjoryMollusc50

The warning indicates that you have markers which have avg.t2d AND avg.healthy as 0. One way of handling this is adding a small pseudo count of eg. 0.0001 to all the mean values. This will ensure that there are no 0's in your data and NANs will be 0 foldch as expected.

ADD REPLYlink written 23 months ago by kautilya400

Thanks for explaining the error. Despite addin the pseudocount, it still came up with the error. Would this be because the 0's in the dataset would be in there as Na instead of 0? If so how would you change those Na's to 0's?

ADD REPLYlink written 23 months ago by MarjoryMollusc50

Can you provide some examples of the mean values for which you are getting this error?

ADD REPLYlink written 23 months ago by kautilya400

I disagree. Don't take logs after taking the mean, Take means after taking the logs, or take logs after taking the geometric means

ADD REPLYlink written 23 months ago by russhh5.0k
1

While seidel has addressed your actual question, I just wanted to take a step back and ask: have you considered using any of the established R packages for differential gene expression analysis, such as DESeq2, edgeR, limma? These packages will calculate the log-fold-changes for you (they'll even do it in a somewhat more sophisticated manner taking the quirks of RNA-seq data into account) and they will even offer you (adjusted) p-values to let you make an educated guess as to which genes might be worth looking at.

ADD REPLYlink written 23 months ago by Friederike5.2k

Yeah I have used limma later on. This exercise was for me to try an get a deeper understanding of what is actually going on behind the scenes when using those packages.

ADD REPLYlink written 23 months ago by MarjoryMollusc50

That's great! If you want to know about the nitty-gritty details, the introduction of the limma-voom paper mentions a lot of the statistical details that are applied in those packages (which is the reason why simply taking means and log-fold changes won't give you the same results)

ADD REPLYlink written 23 months ago by Friederike5.2k
          NA         NA.1         NA.2         NA.3         NA.4 
9.861355e+03 2.121977e+04 2.202288e+00 1.079073e+01 1.132874e+00 
        NA.5 
1.479524e-01

That is the head of avg.t2d With the pseudocount:

> head(avg.t2dps)
          NA         NA.1         NA.2         NA.3         NA.4 
9.861356e+03 2.121977e+04 2.203288e+00 1.079173e+01 1.133874e+00 
        NA.5 
1.489524e-01
ADD REPLYlink written 23 months ago by MarjoryMollusc50
2
gravatar for seidel
23 months ago by
seidel6.9k
United States
seidel6.9k wrote:

You have to know what's going on with your data. It looks like it's already in log format because you're taking the ratio as: foldch <- avg.t2d - avg.healthy

A few things to try:

# are there NA values in the original data?
sum( is.na(avg.t2d) )
sum( is.na(avg.healthy) )

# are there other problematic values in the original data?
sum(is.finite(avg.t2d))
sum(is.finite(avg.healthy))

# for two vectors, examine their comparability
complete.cases(avg.t2d, avg.healthy)

# this also works for rows of a matrix
# (first use cbind to combine your vectors as matrix coloumns)
x <- cbind(avg.t2d, avg.healthy)
complete.cases(x)

# examine cases where the value is not defined for both variables
x[!complete.cases(x),]

# omit cases where the value is not defined for both variables
x <- x[complete.cases(x),]

# you can also use na.omit for this
foo <- na.omit(cbind(avg.t2d, avg.healthy))

# converting to data frame allows you to use your column names again
foo <- as.data.frame(foo)

# gene expression ratio
foldch <- foo$avg.t2d - foo$avg.healthy
ADD COMMENTlink modified 23 months ago • written 23 months ago by seidel6.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 853 users visited in the last hour