heatmap using FPKM values
1
1
Entering edit mode
5.8 years ago
ashish ▴ 670

I have FPKM values for a number of genes. I want to make a heatmap and find differentialy expressed genes. Here I have three questions:

1. Should I use FPKM values directly to make a heatmap or is there anything which has to be done to FPKM values before using them to make a heatmap? I am using MeV to make a heatmap.
2. How do I find differentially expressed genes here? I read that taking log2 values of FPKM and then setting the cutoff as 2 will give differentially expressed gene. Taking the cutoff as 2 sounds very arbitrary, is there any reason to take 2 as the cutoff?
3. There are biological replicates also, each replicate has a specific FPKM value under each treatment. How do I treat these replicates?

Thanks

FPKM RNA-Seq • 8.3k views
1
Entering edit mode

The "workflow" you describe for finding differentially expressed genes is not appropriate/optimal. A good starting point is the workflow from bioconductor for RNA-seq analysis, although it doesn't start from fpkm values.

1
Entering edit mode

Thats another problem. Most tools take count matrices as input.

1
Entering edit mode

0
Entering edit mode

yes no access to raw data. What best can i do with the FPKM values.

1
Entering edit mode

This sounds all to familiar like a class homework exercise.

0
Entering edit mode

Dear Kevin,

I'm a bit confused and hoping you can clear things up for me.

I was trying to make a heatmap of log2-transformed FPKM values but the plots do not make sense to me as genes do not change with conditions and biological replicates do not cluster (I can share code and data if you want). Is this because log2-transformed FPKM values are not comparable between samples? https://ibb.co/m5oLwG http://tinypic.com/r/acdfea/9

To give some background I'm making 3 heatmaps, one each for a different cancer cell line. For each of these cell lines they were exposed to multiple treatments: various chemicals (n=5) compared to the control (DMSO) as well as a number of shRNA (n=4) compared to the control (shCTR).

Users like ashish and WouterDeCoster are not being very helpful by suggesting another workflow (e.g. DESeq2). To make my point clear I just started a new job, and the sequencing facility has given me cuffdiff results so that is what i have to work with!

I know DESeq2 (and other methods which use the generalized linear model) are appropriate for more complex experimental designs as they can deal with multiple factors (explanatory variables) – e.g. experiments involving multiple treatments (my experimental design) and time-points (time course analysis). I told my boss this. The sequencing facility in the future will give me STAR alignments so I can do DESeq2 in the future (i.e. next project). However, currently all I have is these FPKM values and I need to make some figures which is holding back publication of a manuscript of one of the post-docs in our lab.

Okay so to clear things up if I feed my FPKM values into Limma it will "fit a linear model to your data and adjusts using empirical bayes" which would allow me to make some heatmaps and see replicates clustering and differential expression between samples?

Many thanks for the help.

1
Entering edit mode

Hi moldach,

Yes, I have seen heatmaps like yours in the past - the data distribution does not look optimal, but this is related to a problem with FPKM distribution, generally. As far as I understand, you took the significant genes from CuffDiff and then generated the heatmap from these?

Could you just try 2 more heatmaps with the following code (and note to use unlogged FPKM values):

#Set colour
library("RColorBrewer")
myCol <- colorRampPalette(c("dodgerblue", "black", "yellow"))(100)
myBreaks <- seq(-3, 3, length.out=101)

library("gplots")

#Scale the FPKM values to the Z scale
library(zFPKM)
heat <- zFPKM(MyUnLoggedFPKMValues)

#Select out the significant genes from CuffDiff
heat <- heat[which(rownames(heat) %in% MySignificantGenes), ]

pdf("Heatmaps.pdf")
par(mar=c(2,2,2,2), cex=0.8)

#Euclidean distance
heatmap.2(heat,
col=myCol,
breaks=myBreaks,
main="Title",
key=T,
keysize=1.0,
scale="none",
ColSideColors=MySampleColorVector,
density.info="none",
reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
trace="none",
cexRow=0.3,
labCol=NA,
cexCol=0.6,
distfun=function(x) dist(x, method="euclidean"),
hclustfun=function(x) hclust(x, method="ward.D2"))

#1-cor distance
heatmap.2(heat,
col=myCol,
breaks=myBreaks,
main="Title",
key=T,
keysize=1.0,
scale="none",
ColSideColors=MySampleColorVector,
density.info="none",
reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
trace="none",
cexRow=0.3,
labCol=NA,
cexCol=0.6,
distfun=function(x) as.dist(1-cor(t(x))),
hclustfun=function(x) hclust(x, method="ward.D2"))
dev.off()

0
Entering edit mode

I will certainly give this a try Kevin, and thank you, but first I need some clarification on where to get MySignificantGenes from. Is this found in the gene_exp.diff file output from CuffDiff? If so do I just filter by the column significant?

I guess the part that confuses me is that there are a lot of comparisons comparisons going on (e.g. DMSO_1h vs. DMSO_6h, DMSO_1h vs. JQ1_1h, DMSO_1h vs JQ1_6h, DMSO_1h vs. JQMT_1h, DMSO_1h vs. JQMT_6h, DMSO_1h vs. MTX_1h, DMSO_6h vs. JQ1_1h, shCTR vs. shBRD4, shBRD4 vs. shMTHFD1, etc.).

My concern is that since CuffDiff can only do pairwise comparisons perhaps this is somehow not meaningful to look at?

Assuming I do use the gene_exp.diff file (6Gb) this is the following code I've run before I get some weird result:

# Since file is too large for R I do some pre-processing first in linux
cut -f 3,14 gene_exp.diff > edit-degs.diff  # Cut the *gene* and *significant* columns
awk '$2 ~ /yes/ {print$0}' edit-degs.diff > degs.csv  # Filter by significant genes
cut -f 1 degs.csv > degs.csv   # Cut out just the gene names
sed -i 's/-/NA/g' degs.csv     # Turn missing gene names to NA

# Now process *MySignificantGenes* in R
sig <- na.omit(sig)  # Omit NAs
sig <- gsub(",.*", "", sig[,1])   # Since some genes have multiple names (e.g. C16orf45,MPV17L,RP11NA1021N1.1) remove the *,* and everything after it
sig <- as.data.frame(sig) # Since the last command turned it into *character* convert back to data.frame
sig <- unique(sig) # There are many duplicate key values (gene names) so filter them out
nrow(sig)
[1] 20244


So when I get to the heat <- t(scale(t(MyUnLoggedFPKMValues))) my nrow(heat) is [1] 55939. However, I'm getting weird results here:

heat2 <- heat[which(rownames(heat) %in% sig), ]
nrow(heat2)
[1] 0
K562_DMSO_1h_FPKM K562_DMSO_6h_FPKM K562_JQ1_1h_FPKM K562_JQ1_6h_FPKM K562_JQMT_1h_FPKM K562_JQMT_6h_FPKM K562_MTX_1h_FPKM
K562_MTX_6h_FPKM K562_dBet6_1h_FPKM K562_dBet6_6h_FPKM K562_shBRD4_FPKM K562_shMTHBRD_FPKM K562_shMTHFD1_FPKM
K562_shPLKO_FPKM


I compared gene names from sig and from heat to confirm that the genes are present from sig in heat. Not sure why I'm getting this error?

0
Entering edit mode

Yes, but it would be extremely difficult to extract just a single P value for each gene. That is, you have many conditions and time points and would require a more complex form of modelling, coupled with a very large sample size. Cuffdiff is mainly for comparisons between just two conditions.

In your situation, take the top 10 or 20 genes from each comparison, merge these into a vector and remove any duplicate genes that may have been significant in more than one comparison, and then use this list of significant genes to subset your FPKM matrix and to generate your heatmap (with FPKM values that you'll scale yourself using my code above). You should then see good separation. I would not drift away from the pairwise comparisons that have already been done - it may very well be the best way to analyse the data that you've got.

PS - if not implied in the CuffDiff files, it would be beneficial to find out if the P values have been adjusted by FDR!

0
Entering edit mode

Yes, but your experimental set-up is complex and it would be extremely difficult to extract just a single P value for each gene. That is, you have many conditions and time points and would require a more complex form of modelling, couple with a very large sample size. Cuffdiff is mainly for comparisons between just two conditions.

Yes this is pretty much what I was thinking :(

Please note that I modified my comment above sometime ~during your answer to show what I've worked through so far on the code.

take the top 10 or 20 genes from each comparison, merge thees into a vector and remove any duplicate genes that may have been significant in more than one comparison, and then use this list of significant genes to subset your FPKM matrix and to generate your heatmap

Your suggestion seems reasonable but a bit tricky for me to implement - I'll give it a try though. Most of the data.frame manipulation I've done to-date has been on small files in R (using dplyr for example) so I'm having a bit of difficulty trying to figure out how to do this in linux.

There are 5 chemical treatments (each with two time periods) and 4 shRNA treatments. So I did the pairwise comparison calculations for how many I need to do for each of the cell lines. PCs=[k(k-1)]/2. So [1413]/2=91

Let me think this out through pseudo-code now: 1) cut -f 3,5,6,10,14 gene_exp.diff # filter out gene, sample1, sample2, log2(fold-change), and significant columns

2) awk '$5 ~ /yes/ {print$0}' edit-degs.diff > degs.csv # Filter out only those genes which were significant in any pairwise comparison

3) Create 51 smaller datasets (one for each pairwise comparison) - I'm sure there has to be a more elegant way to do this?

awk $2 ~ /DMSO_1h/ &&$3 ~/MTX_6h/ {print \$0}' degs.csv > pc1.csv # Use multiple column matching with awk for each of the comparisons


4) Sort each of these datasets by log2(fold-change) (or FDR p-value) and grab the top 20.

5) Merge these into a vector v1 <- merge(pc1, pc2) than v2 <- merge(v1, pc3), etc. 51 times or use plyr v1 <- join_all(pc1, pc2, ...., pc51)

Oh yeah and about the pairwise comparisons that have already been done well that's just the thing...I don't know/think they've been done yet. I certainly can't access them from the .html and can't see them in SCRATCH folder (just the cuffdiff_global folder). I guess that's a question I'll have to ask my sequencing facility.

0
Entering edit mode

the sequencing facility has given me cuffdiff results so that is what i have to work with

If you already have the Cuffdiff results, then you already have differentially expressed genes.

0
Entering edit mode

To clarify things I used FPKM values from CuffLinks output not CuffDiff. Here is an image from the sequencing facility: http://i67.tinypic.com/qpj86u.jpg Specifically I used Genes (Symbols) which is the Gene FPKM file just with ensembl_gene_names` added.

The CuffDiff results seen here http://i65.tinypic.com/2ic0jle.png are from 816 pair-wise comparisons. The first week I started here none of these links were clickable and I didn't have access to the cluster to look at files in SCRATCH. Now that I have cluster access I can only see a cuffdiff_global folder http://i67.tinypic.com/ipun7s.png but no individual pairwise comparison results. Should I be using the genes.fpkm_tracking file instead?

4
Entering edit mode
5.2 years ago

Dear Ashish,

For your situation, I would not use MeV.

I recommend reading your data into R and then transforming to Z scale via zFPKM. You may then use standard statistical tests on these Z-scores for the purpose of identifying statistically significant genes.

You may take a look at this thread where I give some information on producing heatmaps in R: How to cluster the upregulated and downregulated genes in heatmap?

Also, here is a more advanced version of a heatmap in R: How to plot a heatmap with two different distance matrices for X and Y

Kevin