Question: Heatmap based with FPKM values
0
gravatar for Mehmet
17 months ago by
Mehmet460
Japan
Mehmet460 wrote:

Dear All,

I have a data set of 17 samples with FPKM values. In the data set I have ~800 differentially expressed genes some of which have zero FPKM value.

I would like to ask you about generating heatmap.

  1. Can I use FPKM values directly for generating heatmap?

  2. Do I have to perform additional analysis with FPKM values prior to generating heatmap?

  3. According to your experience, which genes (based on FPKM values) should I use in heatmap analysis?

Thank you

rna-seq next-gen R • 3.6k views
ADD COMMENTlink modified 17 months ago by Kevin Blighe41k • written 17 months ago by Mehmet460
1

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLYlink modified 10 weeks ago • written 8 months ago by Kevin Blighe41k
3
gravatar for Kevin Blighe
17 months ago by
Kevin Blighe41k
London, England
Kevin Blighe41k wrote:

Yes, you can begin with FPKM values but you will have to transform these values and also filter the dataset with your 800 differentially expressed genes. On that note, 800 is a lot of genes. Try to increase your cut-offs for statistically significantly differentially expressed. Try things like:

  • FDR Q<0.05 and absolute log2 fold-change>2
  • FDR Q<0.01 and absolute log2 fold-change>4
  • FDR Q<0.01 and absolute log2 fold-change>2
  • FDR Q<0.001 and absolute log2 fold-change>2
  • FDR Q<0.001 and absolute log2 fold-change>4
  • FDR Q<0.0001 and absolute log2 fold-change>2

et cetera.

Use this code (below).

  • Your FPKM values will be stored in MyFPKMValues
  • DiffExpressedGenes will comprise a single vector of genes that are differentially expressed
  • zFPKM package will be used to convert your FPKM values to the Z-scale prior to clustering.

---------------------------

Set colour and heatmap scaling breaks

require("RColorBrewer")
myCol <- colorRampPalette(c("dodgerblue", "black", "yellow"))(100)
myBreaks <- seq(-3, 3, length.out=101)

Scale the FPKM values to the Z scale

library(zFPKM)
heat <- zFPKM(MyFPKMValues)

Filter your dataset to include your differentially expressed genes:

heat <- heat[which(rownames(heat) %in% DiffExpressedGenes), ]

Generate heatmaps with Euclidean distance (first) and '1 - Pearson correlation' distance (second) (both use Ward's linkage)

require("gplots")

#Euclidean distance
heatmap.2(heat,
  col=myCol,
  breaks=myBreaks,
  main="Title",
  key=T, keysize=1.0,
  scale="none",
  density.info="none",
  reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
  trace="none",
  cexRow=0.2,
  cexCol=0.8,
  distfun=function(x) dist(x, method="euclidean"),
  hclustfun=function(x) hclust(x, method="ward.D2"))

#1-cor distance
heatmap.2(heat,
  col=myCol,
  breaks=myBreaks,
  main="Title",
  key=T, keysize=1.0,
  scale="none",
  density.info="none",
  reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
  trace="none",
  cexRow=0.2,
  cexCol=0.8,
  distfun=function(x) as.dist(1-cor(t(x))),
  hclustfun=function(x) hclust(x, method="ward.D2"))

To each heatmap command, you can add ColSideColors, which is a vector of colours for a condition of interest, such as case/controls. The order of this colour vector has to match the order of samples in your 'heat' object that you pass to heatmap.2

heatmap


Note that converting to the Z scale is not exclusive: These guys median-centered their FPKM data and then log (base 2) transformed them prior to heatmap generation

ADD COMMENTlink modified 10 weeks ago • written 17 months ago by Kevin Blighe41k

Hi Kevin,

In this code:

heat <- heat[which(rownames(heat) %in% DiffExpressedGenes), ]

what is the DiffExpressedGenes) ?

ADD REPLYlink written 17 months ago by Mehmet460

In my FPKM data,

some genes have 0 (zero) FPKM values in some samples. Is it okay to remove those genes?

ADD REPLYlink written 17 months ago by Mehmet460
1

DiffExpressedGenes is a vector that contains the names of your differentially expressed genes. These names should all be part of the rownames of your FPKM matrix.

I believe you can include FPKM values that are 0. Why? - due to the fact that we scale / transform the data to the Z-scale, these values will return as -4 or -6 (or something else that is negative). You cannot have any NAs in your data, but zeros are okay.

ADD REPLYlink modified 10 weeks ago • written 17 months ago by Kevin Blighe41k

Hi Kevin, Thank you.

I created GeneIDs file, and I used heatmap.2(heat, col=myCol, breaks=myBreaks, main="Title", key=T, keysize=1.0, scale="none", density.info="none", reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean), trace="none", cexRow=0.2, cexCol=0.8, distfun=function(x) dist(x, method="euclidean"), hclustfun=function(x) hclust(x, method="ward.D2"))

but I received this error:

Error in heatmap.2(heat, col = myCol, breaks = myBreaks, main = "Title",  :

`x' must have at least 2 rows and 2 columns

this is my heat file:

Stage1 Stage2 Stage3  Stage17
ADD REPLYlink written 17 months ago by Mehmet460
1

It should contain genes as rows and samples as columns. The values should be numerical.

Most likely, your filtering of differentially expressed genes has been incorrect.

What are the contents of DiffExpressedGenes and how did you obtain these?

ADD REPLYlink modified 10 weeks ago • written 17 months ago by Kevin Blighe41k

I printed the first column that has gene names using awk command.

the content of DiffExpressedGenes:

      V1
1 Gene_1303900.1
2 Gene_1038800.1
3 Gene_1579700.1
4 Gene_1031800.1
5 Gene_1031700.1
6 Gene_1579600.1
ADD REPLYlink modified 17 months ago • written 17 months ago by Mehmet460

Hi Kevin,

Do you mean that the DiffExpressedGenes must have only gene names and the MyFPKMValues must have both gene names in each sample with FPKM values? If yes, my files are so.

ADD REPLYlink written 17 months ago by Mehmet460

Yes, that should be the situation. I am worried about the '.1' at the end of your gene names in DiffExpressedGenes, though. Are these names exactly as they appear in the rownames for MyFPKMValues?

ADD REPLYlink modified 10 weeks ago • written 17 months ago by Kevin Blighe41k

Yes, they are. I picked gene names from the MyFPKMValues.

ADD REPLYlink written 17 months ago by Mehmet460
1

Wait, DiffExpressedGenes looks like a data-frame.

Try this instead for the filtering step:

heat <- heat[which(rownames(heat) %in% DiffExpressedGenes[,1]), ]
ADD REPLYlink written 17 months ago by Kevin Blighe41k

Now OK. but I received this error:

Error in hclust(x, method = "ward.D2") :

NA/NaN/Inf in foreign function call (arg 11)

Zero values have caused this error?How to deal with this error?

ADD REPLYlink written 17 months ago by Mehmet460

Hi Kevin,

OK. I omitted NaN values using: heat <- na.omit(heat)

Then I generated heatmap.

I would like to have your advice. In my FPKM data I have ~800 genes. These genes are effector genes (composed of three different gene families) and have FPKM values in each stage. To decrease number of genes to make heatmap more visible, what should I do? You mentioned to use FDR or log2 values, but I just have FPKM values. Is it okay to remove some genes that are under a threshold like 100 FPKM?

by the way, Thank you very much for your help at both PCA and heatmap post.

ADD REPLYlink written 17 months ago by Mehmet460
1

No problem, Mehmet.

From where did you obtain these FPKM values? Do you want to conduct differential expression analysis on them?

You could remove low FPKM genes - sure. You could also remove genes that have low variance (probably better).

Using something like this, you can identify which genes fall in the bottom half of the variance range:

geneVariances <- apply(MyFPKMValues, 1, var), na.rm=TRUE)

maxVariance <- max(geneVariances)

minVariance <- min(geneVariances)

cutoffVariance <- maxVariance - ((maxVariance - minVariance) / 2)

which(geneVariances<cutoffVariance)

The final line of code (which(geneVariances<cutoffVariance)) should return the indices of the genes that fall in the lower half of variance in your dataset.

ADD REPLYlink modified 10 weeks ago • written 17 months ago by Kevin Blighe41k

I generated from cuffnorm tool.

According to your advice I will remove genes that are produced from (which(geneVariances<cutoffVariance))

ADD REPLYlink written 17 months ago by Mehmet460
2

You can try CuffDiff to obtain a P value (if you even have samples that you want to compare via differential expression?).

The which() command will just return indices of genes, which you an use to filter.

Also be aware that Cufflinks is not supported anymore. You should aim to use HISAT2/StringTie the next time.

ADD REPLYlink written 17 months ago by Kevin Blighe41k

Hi Kevin,,

I want to ask one more thing. How to show sample names in heatmap based on orders in FPKM data?

when I draw heatmap, sample order has changed. How can I keep the same order with the original file?

ADD REPLYlink written 16 months ago by Mehmet460
1

Hi Mehmet, wait, is this for ComplexHeatmap (from our other thread: how to cluster genes in heatmap ) or heatmap.2 (gplots)?

For ComplexHeatmap, just supply row_order or column_order as extra parameters to the Heatmap() function. These just have to be vectors of rownames or column names (that match your data matrix). ComplexHeatmap will then fix the order based on these.

For heatmap.2, you can maintain order as per your data-matrix by configuring RowV and ColV, e.g., RowV=FALSE, ColV=TRUE, will cluster your columns but the rows (genes) will be ordered as they appear in your supplied data-matrix.

ADD REPLYlink written 16 months ago by Kevin Blighe41k

Hi Kevin,

For complexHeatmap:

I tried but I could not do.

What I did is: 1. I generate a table that has sample names

sample1
sample2
sample3

2. saved the table as "samplenames" 3. I added row_names = "samplenames" in heatmap function.

The order on the right side of the heatmap is the correct order based on my data.matrix.

What I want is to make the same order in the column order below the heatmap as it is in my data.matrix

ADD REPLYlink written 16 months ago by Mehmet460
1

Hi mate, no, samplenames should just be a vector. You can create it like this:

samplenames <- c("Sample1", "Sample2", ..., "Sample X")

or

samplenames <- colnames(ExpressionMatrix)

samplenames should not be a data-frame, even if it's a single column data-frame.

ADD REPLYlink modified 10 weeks ago • written 16 months ago by Kevin Blighe41k

Hi Kevin,

I did but column names below the heatmap did not ordered as they are in the data.matrix. The annotation side is ordered as they are in the data.matrix.

What I did is:

samplenames <- c("Sample1", "Sample2", ..., "Sample X")

and

column_order = samplesnames
ADD REPLYlink written 16 months ago by Mehmet460
1

Oh, sometimes, because there are so any parameters in ComplexHeatmap, one parameter 'over-rides' another. Try also:

cluster_columns=FALSE and cluster_rows=FALSE

ADD REPLYlink written 16 months ago by Kevin Blighe41k

Hi Kevin,

I did, but it did not worked.

I run this:

hmap <- Heatmap(heat, name="Z-score",col=colorRamp2(myBreaks, myCol), column_order = samplesnames, cluster_columns = FALSE, heatmap_legend_param=list(color_bar="continuous", legend_direction="horizontal", legend_width=unit(3,"cm"), title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")), split=df$EffectorName, cluster_rows=TRUE, show_row_dend=TRUE, row_title="Transcript", row_title_side="left", row_title_gp=gpar(fontsize=5, fontface="bold"), show_row_names=TRUE, row_names_side="left", row_title_rot=0, cluster_columns=TRUE, show_column_dend=TRUE, column_title="Samples", column_title_side="top", column_title_gp=gpar(fontsize=10, fontface="bold"), column_title_rot=0, show_column_names=TRUE, clustering_distance_columns=function(x) as.dist(1-cor(t(x))), clustering_method_columns="ward.D2", column_dend_height=unit(5,"mm"), clustering_distance_rows="euclidean", clustering_method_rows="ward.D2", row_dend_width=unit(5,"mm"), top_annotation_height=unit(0.5,"cm"), top_annotation=ColAnn, bottom_annotation_height=unit(1, "cm"), bottom_annotation=boxAnnCol)

this is the error:

    Error in Heatmap(heat, name = "Z-score", col = colorRamp2(myBreaks, myCol),  : 
  formal argument "cluster_columns" matched by multiple actual arguments
ADD REPLYlink written 16 months ago by Mehmet460
1

Oh, you have cluster_columns mentioned twice. The first mention of it, it's TRUE, and the second it's FALSE.

Take a closer look at your code for the Heatmap function. You'll see it there twice.

ADD REPLYlink written 16 months ago by Kevin Blighe41k

Hi Kevin,

I know I am asking too much help from you. I really appreciate for your help.

I wrote cluster_columns = TRUE but not changed order of samples. I also tried cluster_columns = FALSE but I received this error:

Error in if (sum(l1)) grid.points(x = rep(i, sum(l1)), y = x[, i][l1], : argument is not interpretable as logical Error when drawing annotation 'boxplot' Error in .local(object, ...) : Error in if (sum(l1)) grid.points(x = rep(i, sum(l1)), y = x[, i][l1], : argument is not interpretable as logical

It produced a graph with same order in matrix.data, except annotations, but a sample was NA.

ADD REPLYlink written 16 months ago by Mehmet460

Can you paste a sample of your data?

ADD REPLYlink written 16 months ago by Kevin Blighe41k
FamilyName GeneID Egg_Ir12 Egg_Ir13 L2_Ir14 L2_Ir15 L3_Ir200 L3_Ir201 L4_Ir203 L4_Ir202 FA_Ir205 FA_Ir204 MA_Ir206 MA_Ir207 D3_Ir1279 D3_Ir1278 D4_Im1215 D4_Im1210 D4_Im1214 
GH16 gene_1303900.1 25.7847 19.7103 22.6148 26.5067 25.0661 23.3288 25.0834 24.1146 111.904 116.042 29.1163 27.7907 22.3014 21.3545 66.5493 64.4022 53.0803
GH16 gene_1038800.1 2.34831 2.95942 6.84916 7.94771 29.7827 28.9562 53.2878 48.1816 84.8273 76.8617 68.5481 57.4876 41.1048 37.7387 4.80982 0 0
GH16 gene_1579700.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4.72954 0 0
GH16 gene_1031800.1 0.584824 0 0 0 0.726056 1.4398 5.05124 2.66249 0.576734 1.89816 0.894391 2.67472 0 0 0

========================================================================

ADD REPLYlink modified 16 months ago by Kevin Blighe41k • written 16 months ago by Mehmet460

I am trying to make the data understandable, but I do not know how to do.

ADD REPLYlink written 16 months ago by Mehmet460

GH16 is FamiliyName. Egg, L2, L3 ... sample name

ADD REPLYlink written 16 months ago by Mehmet460

Ensure that there are no NAs in your data. Also, your gene 'gene_1579700.1' has virtually no data...

You may want to post a new question if you still have problems.

For clustering, a data-matrix with no NA values must be used. Also, for the distance metrics to function properly, too many zero values will cause problems.

ADD REPLYlink modified 10 weeks ago • written 16 months ago by Kevin Blighe41k

the zFPKM link (https://bioconductor.org/packages/release/bioc/html/zFPKM.html) is down.

ADD REPLYlink written 11 months ago by Zhilong Jia1.4k

Hey, it seems to be okay for me. Are you getting the 404 error?

ADD REPLYlink written 11 months ago by Kevin Blighe41k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1140 users visited in the last hour