Question

RNA-Seq Data Quality Assesment- Heatmap and PCA Interpretation

0

Entering edit mode

3.7 years ago

Aynur ▴ 60

I am following STAR-HTSeq -DESeq2 pipeline for my mouse RNA-Seq data analysis. I am concerned about heatmap and PCA results. I am concerned about sample b and I was expecting it should not cluster with the control group. Am I missing something here? a, b,c,d are different treatment conditions and each one has two biological replicates. Here is my heatmap. Heatmap for samples PCA plot for samples

Should I be concerned about sample b ? How to interpret these plots? Any advice or article recommendation is appreciated. Thank you very much.

sequence rna-seq R next-gen • 4.7k views

ADD COMMENT • link updated 3.7 years ago by antonioggsousa 3.2k • written 3.7 years ago by Aynur ▴ 60

score 3 · Accepted Answer · 2020-08-22

Hi,

In my opinion what that means is that among your treatments, treatment b is the most similar to the control condition, and it is virtually the same or quite similar to the control. Assuming that you have normalized your data before doing these analyses, using a vst or rlog normalization, what this means is that the gene expression profile between the treatment b and the control is virtually the same. So, your treatment b has not effect over the whole gene expression profile.

It may have, but it is so low that is difficult to quantify in relation to the control (perhaps with an higher no. of replicates), or the difference between these, treatment b vs. control, is only in a small no. of genes, and so these techniques are not detecting these differences.

Regarding PCA you might want to read this post. Essentially, tries to capture the variability in your gene expression profile. It only plots the first two most important PCs (Principal Components) that explain most of the variability, in your case 90%. If two samples/points are close, that means they have a similar gene expression profile. You need to be careful and read the plot figure through the x-axis or y-axis, since they explain different sources of variability in your data.

The heatmap, it depends on the distance you use. it seems that you've used a correlation metric, i.e., Pearson or Spearman, so it represents the correlation of gene expression profiles. If two samples are closer it means they are more correlated than the others. You can look to the index color to see if the correlation is high or low. If it is close to one, it means that are virtually the same. Though with high no. of genes being compared it is not difficult to get high correlation scores by change.

I hope this helps,

António