Question: heatmap using FPKM values
1
gravatar for ashish
10 months ago by
ashish170
ashish170 wrote:

I have FPKM values for a number of genes. I want to make a heatmap and find differentialy expressed genes. Here I have three questions:

  1. Should I use FPKM values directly to make a heatmap or is there anything which has to be done to FPKM values before using them to make a heatmap? I am using MeV to make a heatmap.
  2. How do I find differentially expressed genes here? I read that taking log2 values of FPKM and then setting the cutoff as 2 will give differentially expressed gene. Taking the cutoff as 2 sounds very arbitrary, is there any reason to take 2 as the cutoff?
  3. There are biological replicates also, each replicate has a specific FPKM value under each treatment. How do I treat these replicates?

Thanks

rna-seq fpkm • 1.1k views
ADD COMMENTlink modified 3 months ago by moldach50 • written 10 months ago by ashish170
1

The "workflow" you describe for finding differentially expressed genes is not appropriate/optimal. A good starting point is the workflow from bioconductor for RNA-seq analysis, although it doesn't start from fpkm values.

ADD REPLYlink written 10 months ago by WouterDeCoster24k
1

Thats another problem. Most tools take count matrices as input.

ADD REPLYlink written 10 months ago by ashish170
1

So you only have FPKMs? No access to raw data?

ADD REPLYlink written 10 months ago by WouterDeCoster24k

yes no access to raw data. What best can i do with the FPKM values.

ADD REPLYlink written 10 months ago by ashish170
1

This sounds all to familiar like a class homework exercise.

ADD REPLYlink written 3 months ago by theobroma221.1k

Dear Kevin,

I'm a bit confused and hoping you can clear things up for me.

I was trying to make a heatmap of log2-transformed FPKM values but the plots do not make sense to me as genes do not change with conditions and biological replicates do not cluster (I can share code and data if you want). Is this because log2-transformed FPKM values are not comparable between samples? https://ibb.co/m5oLwG http://tinypic.com/r/acdfea/9

To give some background I'm making 3 heatmaps, one each for a different cancer cell line. For each of these cell lines they were exposed to multiple treatments: various chemicals (n=5) compared to the control (DMSO) as well as a number of shRNA (n=4) compared to the control (shCTR).

Users like ashish and WouterDeCoster are not being very helpful by suggesting another workflow (e.g. DESeq2). To make my point clear I just started a new job, and the sequencing facility has given me cuffdiff results so that is what i have to work with!

I know DESeq2 (and other methods which use the generalized linear model) are appropriate for more complex experimental designs as they can deal with multiple factors (explanatory variables) – e.g. experiments involving multiple treatments (my experimental design) and time-points (time course analysis). I told my boss this. The sequencing facility in the future will give me STAR alignments so I can do DESeq2 in the future (i.e. next project). However, currently all I have is these FPKM values and I need to make some figures which is holding back publication of a manuscript of one of the post-docs in our lab.

Okay so to clear things up if I feed my FPKM values into Limma it will "fit a linear model to your data and adjusts using empirical bayes" which would allow me to make some heatmaps and see replicates clustering and differential expression between samples?

Many thanks for the help.

ADD REPLYlink modified 3 months ago • written 3 months ago by moldach50
1

Hi moldach,

Yes, I have seen heatmaps like yours in the past - the data distribution does not look optimal, but this is related to a problem with FPKM normalisation generally. As far as I understand, you took the significant genes from CuffDiff and then generated the heatmap from these?

Before using Limma, could you just try 2 more heatmps with the following code (and note to use unlogged FPKM values for this particular situation - we transform the data another way):

#Set colour
require("RColorBrewer")
myCol <- colorRampPalette(c("dodgerblue", "black", "yellow"))(100)
myBreaks <- seq(-3, 3, length.out=101)

require("gplots")

#Scale the FPKM values to the Z scale
heat <- t(scale(t(MyUnLoggedFPKMValues)))

#Select out the significant genes from CuffDiff
heat <- heat[which(rownames(heat) %in% MySignificantGenes), ]

pdf("Heatmaps.pdf")
    par(mar=c(2,2,2,2), cex=0.8)

    #Euclidean distance
    heatmap.2(heat, col=myCol, breaks=myBreaks, main="Title", key=T, keysize=1.0, scale="none", ColSideColors=MySampleColorVector, density.info="none", reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean), trace="none", cexRow=0.3, labCol=NA, cexCol=0.6, distfun=function(x) dist(x, method="euclidean"), hclustfun=function(x) hclust(x, method="ward.D2"))
    legend("right", bty="n", cex=0.6, title="Treatment", c("dBet6_1h", ...", "shPLKO"), fill=c("lavendar", "...", "marineblue"))

    #1-cor distance
    heatmap.2(heat, col=myCol, breaks=myBreaks, main="Title", key=T, keysize=1.0, scale="none", ColSideColors=MySampleColorVector, density.info="none", reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean), trace="none", cexRow=0.3, labCol=NA, cexCol=0.6, distfun=function(x) as.dist(1-cor(t(x))), hclustfun=function(x) hclust(x, method="ward.D2"))
    legend("right", bty="n", cex=0.6, title="Treatment", c("dBet6_1h", ...", "shPLKO"), fill=c("lavendar", "...", "marineblue"))
dev.off()
ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe11k

I will certainly give this a try Kevin, and thank you, but first I need some clarification on where to get MySignificantGenes from. Is this found in the gene_exp.diff file output from CuffDiff? If so do I just filter by the column significant?

I guess the part that confuses me is that there are a lot of comparisons comparisons going on (e.g. DMSO_1h vs. DMSO_6h, DMSO_1h vs. JQ1_1h, DMSO_1h vs JQ1_6h, DMSO_1h vs. JQMT_1h, DMSO_1h vs. JQMT_6h, DMSO_1h vs. MTX_1h, DMSO_6h vs. JQ1_1h, shCTR vs. shBRD4, shBRD4 vs. shMTHFD1, etc.).

My concern is that since CuffDiff can only do pairwise comparisons perhaps this is somehow not meaningful to look at?

Assuming I do use the gene_exp.diff file (6Gb) this is the following code I've run before I get some weird result:

# Since file is too large for R I do some pre-processing first in linux
cut -f 3,14 gene_exp.diff > edit-degs.diff  # Cut the *gene* and *significant* columns
awk '$2 ~ /yes/ {print $0}' edit-degs.diff > degs.csv  # Filter by significant genes
cut -f 1 degs.csv > degs.csv   # Cut out just the gene names
sed -i 's/-/NA/g' degs.csv     # Turn missing gene names to NA

# Now process *MySignificantGenes* in R
sig <- read.csv("degs.csv") # Read in the dataset
sig <- na.omit(sig)  # Omit NAs
sig <- gsub(",.*", "", sig[,1])   # Since some genes have multiple names (e.g. C16orf45,MPV17L,RP11NA1021N1.1) remove the *,* and everything after it
sig <- as.data.frame(sig) # Since the last command turned it into *character* convert back to data.frame
sig <- unique(sig) # There are many duplicate key values (gene names) so filter them out
nrow(sig)
[1] 20244

So when I get to the heat <- t(scale(t(MyUnLoggedFPKMValues))) my nrow(heat) is [1] 55939. However, I'm getting weird results here:

heat2 <- heat[which(rownames(heat) %in% sig), ]
nrow(heat2)
[1] 0
head(heat2)
     K562_DMSO_1h_FPKM K562_DMSO_6h_FPKM K562_JQ1_1h_FPKM K562_JQ1_6h_FPKM K562_JQMT_1h_FPKM K562_JQMT_6h_FPKM K562_MTX_1h_FPKM
 K562_MTX_6h_FPKM K562_dBet6_1h_FPKM K562_dBet6_6h_FPKM K562_shBRD4_FPKM K562_shMTHBRD_FPKM K562_shMTHFD1_FPKM
 K562_shPLKO_FPKM

I compared gene names from sig and from heat to confirm that the genes are present from sig in heat. Not sure why I'm getting this error?

ADD REPLYlink modified 3 months ago • written 3 months ago by moldach50

Yes, but your experimental set-up is complex and it would be extremely difficult to extract just a single P value for each gene. That is, you have many conditions and time points and would require a more complex form of modelling, couple with a very large sample size. Cuffdiff is mainly for comparisons between just two conditions.

In your situation, and I've done this before, take the top 10 or 20 genes from each comparison, merge thees into a vector and remove any duplicate genes that may have been significant in more than one comparison, and then use this list of significant genes to subset your FPKM matrix and to generate your heatmap (with FPKM vaues that you'll scale yourself using my code above). You should then see good separation. I would not drift away from the pairwise comparisons that have already been done - it may very well be the best way to analyse the data that you've got.

PS - if not implied in the CuffDiff files, it would be beneficial to find out if the P values have been adjusted by FDR!

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe11k

Yes, but your experimental set-up is complex and it would be extremely difficult to extract just a single P value for each gene. That is, you have many conditions and time points and would require a more complex form of modelling, couple with a very large sample size. Cuffdiff is mainly for comparisons between just two conditions.

Yes this is pretty much what I was thinking :(

Please note that I modified my comment above sometime ~during your answer to show what I've worked through so far on the code.

take the top 10 or 20 genes from each comparison, merge thees into a vector and remove any duplicate genes that may have been significant in more than one comparison, and then use this list of significant genes to subset your FPKM matrix and to generate your heatmap

Your suggestion seems reasonable but a bit tricky for me to implement - I'll give it a try though. Most of the data.frame manipulation I've done to-date has been on small files in R (using dplyr for example) so I'm having a bit of difficulty trying to figure out how to do this in linux.

There are 5 chemical treatments (each with two time periods) and 4 shRNA treatments. So I did the pairwise comparison calculations for how many I need to do for each of the cell lines. PCs=[k(k-1)]/2. So [1413]/2=91

Let me think this out through pseudo-code now: 1) cut -f 3,5,6,10,14 gene_exp.diff # filter out gene, sample1, sample2, log2(fold-change), and significant columns

2) awk '$5 ~ /yes/ {print $0}' edit-degs.diff > degs.csv # Filter out only those genes which were significant in any pairwise comparison

3) Create 51 smaller datasets (one for each pairwise comparison) - I'm sure there has to be a more elegant way to do this?

awk `$2 ~ /DMSO_1h/ && $3 ~/MTX_6h/ {print $0}' degs.csv > pc1.csv # Use multiple column matching with awk for each of the comparisons

4) Sort each of these datasets by log2(fold-change) (or FDR p-value) and grab the top 20.

5) Merge these into a vector v1 <- merge(pc1, pc2) than v2 <- merge(v1, pc3), etc. 51 times or use plyr v1 <- join_all(pc1, pc2, ...., pc51)

Oh yeah and about the pairwise comparisons that have already been done well that's just the thing...I don't know/think they've been done yet. I certainly can't access them from the .html and can't see them in SCRATCH folder (just the cuffdiff_global folder). I guess that's a question I'll have to ask my sequencing facility.

ADD REPLYlink modified 3 months ago • written 3 months ago by moldach50

Steps 1 and 2 look reasonable, but from those steps all you need is a list of significant genes. awk is fine for filtering but you should filter on both log2 fold change and the P value. For example: awk '$2>2.0 || $2<(2.0*-1)' CuffDiff.out | awk '$3<0.05' (assuming columns $2 and $3 are log2 fold change and P value, respectively). From the output of these, just use cut to grab the significant gene names into a single-column file. Once you have significant genes from each comparison, use cat to piece them together, uniq to remove duplicate genes, and then read this into R where you'll subset the FPKM matrix.

Sorry that you're facing these issues but the core facility should have logs of how each analysis was conducted. As a core service, they need to be held accountable for all data that they return to customers/collaborators.

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe11k

the sequencing facility has given me cuffdiff results so that is what i have to work with

If you already have the Cuffdiff results, then you already have differentially expressed genes.

ADD REPLYlink written 3 months ago by igor5.0k

I believe that he wanted to do the differential expression himself in order to see see if he found the same results using limma as were found with CuffDiff

ADD REPLYlink written 3 months ago by Kevin Blighe11k

To clarify things I used FPKM values from CuffLinks output not CuffDiff. Here is an image from the sequencing facility: http://i67.tinypic.com/qpj86u.jpg Specifically I used Genes (Symbols) which is the Gene FPKM file just with ensembl_gene_names added.

The CuffDiff results seen here http://i65.tinypic.com/2ic0jle.png are from 816 pair-wise comparisons. The first week I started here none of these links were clickable and I didn't have access to the cluster to look at files in SCRATCH. Now that I have cluster access I can only see a cuffdiff_global folder http://i67.tinypic.com/ipun7s.png but no individual pairwise comparison results. Should I be using the genes.fpkm_tracking file instead?

ADD REPLYlink modified 3 months ago • written 3 months ago by moldach50

You i would have used the data from genes.fpkm_tracking (or genes.fpkm_table if you have it, which is the merge of all of your samples), subsetted this with the significant genes from CuffDiff, and then performed clustering/heatmap generation.

For further analyses, also start with genes.fpkm_tracking/table, log it, check the distribution with hist() to ensure binomial distribution, and then run limma().

ADD REPLYlink written 3 months ago by Kevin Blighe11k
3
gravatar for Kevin Blighe
3 months ago by
Kevin Blighe11k
London/Brazil
Kevin Blighe11k wrote:

Dear Ashish,

For your situation, I would not use MeV.

I recommend reading your data into R, log base 2 transform them, and then run Limma in order to conduct differential expression.

Here is a recent thread I did on Limma: differentially expressed genes

You may take a look at this thread where I give some information on producing heatmaps in R: How to cluster the upregulated and downregulated genes in heatmap?

Also, here is a more advanced version of a heatmap in R: How to plot a heatmap with two different distance matrices for X and Y

Kevin

ADD COMMENTlink modified 3 months ago • written 3 months ago by Kevin Blighe11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1716 users visited in the last hour