Question: RNA-seq replicates not clustering
gravatar for sckinta
6 months ago by
United States
sckinta480 wrote:

I know someone posted similar topics before (rna seq replicates not clustering ). However, here, I want to prompt some discussions on this topic with given plots and examples here.

I have triplicates for three conditions (Ev. Mut and WT) in mouse cells. I followed standard ways people usually used to analyze RANseq data: alignment (STAR) -> raw count (HTseq with intersection-strict mode) -> TMM normalization on edgeR -> VSN to 'disassociate dependence of the variance on the mean intensity' -> use VSN value for exploratory analysis (QC).

I did both PCA and pairwise-correlation clustering to find 'outlier'.

  • PCA plot:

enter image description here

enter image description here

  • pairwise-correlation heatmap:

enter image description here

From PCA plot, I can generally tell one sample from each condition (wt_rep2, ev_rep1 and mut_rep1) are obviously outliers. pairwise-correlation confirms that outlier. However, the rest of samples are not obviously clustered based on conditions. Yes, those data are totally messed up.

Now here are three options I plan to proceed (with questions):

  1. delete outliers and use rest samples for DE analysis. Since I have only 3 replicates, how could I just two samples for DE analysis?
  2. treat outlier and normal clustering as batch effect, incorperate batch effect as covariant in design.matrix (~ condition + batch), even though I have no idea what possible condition-independent factors cause them as outliers. Is this valid?
  3. Ignore QC, go ahead with DE analysis. Then what is the point to do a QC exploratory analysis?
clustering rna-seq • 636 views
ADD COMMENTlink modified 4 months ago by Biostar ♦♦ 20 • written 6 months ago by sckinta480

Just to help others, is it likely that there is a batch effect for those samples to the left of the PCA bi-plot? How does your PCA bi-plot look if you just use 'my' code: A: PCA plot from read count matrix from RNA-Seq ?

ADD REPLYlink modified 4 months ago • written 6 months ago by Kevin Blighe37k

PC1-5 scatterplots: PC1-3 bi-plots:

ADD REPLYlink written 6 months ago by sckinta480

I would not go by CPM. I would analyse the samples in DESeq2 and produce regularised log counts. Prior to normalisation, I would remove transcripts whose mean raw count was <10 or <15. Then, regenerate the plots.

If you proceed to DEA with the current samples, then I imagine that many transcripts would fail Cook's test for outliers based on having large variance. Your PC1 variance is >40%, which is quite substantial.

ADD REPLYlink written 6 months ago by Kevin Blighe37k

This is not CPM. It is vsn output. It is like rlog in DESeq2

ADD REPLYlink modified 6 months ago • written 6 months ago by sckinta480

Why is cpm in your PCA bi-plot titles? VSN is not like rlog in DESeq2

ADD REPLYlink modified 4 months ago • written 6 months ago by Kevin Blighe37k

It is just a name. I called vsn as vsnCPM. The plots were made from vsn output. VSN is variance stabilizing method, which is what rlog was trying to do. but rlog is less sensitive to sample size

ADD REPLYlink written 6 months ago by sckinta480

I did remove low exp genes before normalization. Actually I tried to PCA top 50% genes, the results are very similar. I have very low depth on samples. After mapping and filtering, I have only 2M ~ 9M (3M on average) reads left per sample. When I was doing the low exp genes, I just made sure top 3 samples have at least 1 mapped read per sample. This filtered out 40% genes I called. Do you think it would change if I increase my expression filtering cut off?

ADD REPLYlink written 6 months ago by sckinta480

You have small sample size and shallow sequencing, it is no surprise there is a lot of noise. Does the "outliers" correlate with sequencing depth?

ADD REPLYlink written 6 months ago by h.mon23k

Well, only three samples, I cannot say correlation. I do observe outliers have relatively more read count. Here are read count per sample

Th2KO_ev_rep1   7,616,704
Th2KO_ev_rep2   3,455,495
Th2KO_ev_rep3   2,867,725
Th2KO_mut_rep1  9,538,351
Th2KO_mut_rep2  5,271,933
Th2KO_mut_rep3  2,498,574
Th2KO_wt_rep1   5,305,279
Th2KO_wt_rep2   5,034,473
Th2KO_wt_rep3   2,847,010
ADD REPLYlink modified 6 months ago • written 6 months ago by sckinta480

What is the % variation on your PC1?

Sequencing depth does not appear to be the sole reason, as Th2KO_wt_rep1 also has high depth but groups fine with the other samples.

If you have still not sorted this out, can you show your code and also state clearly the source of the samples and any errors that occurred during sample processing in the wet-lab. You never clearly stated this, from what I can see. One cannot introduce a batch variable for no logical reason, of course.

ADD REPLYlink modified 4 months ago • written 4 months ago by Kevin Blighe37k
gravatar for h.mon
6 months ago by
h.mon23k wrote:

With just three replicates per treatment, I would be very cautious of calling outliers, one needs larger sample sizes to confidently identify outliers. Thus, I wouldn't do 1. Also, I don't see a clear batch effect (clustering is not related to replicate number, for example), and introducing a "batch" effect just because the samples didn't cluster the way you expect seems like torturing the data - thus I wouldn't do 2 also.

ADD COMMENTlink written 6 months ago by h.mon23k

So you would just go head with DE analysis with all three replicates?

ADD REPLYlink written 6 months ago by sckinta480
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2444 users visited in the last hour