Question: The interpretation of PCA
0
gravatar for Za
2.4 years ago by
Za130
Za130 wrote:

Hi,

I have bulk RNA-seq and single cell RNA-seq data on the same organism in 9 time points (2h, 4h, ..., 16h) in two replications. I plotted PCA of two datasets (R and T). Based on this picture my interpretation is this: 42% of variance is because of time because likely I have 9 groups of samples (time points) during the time and 34% of variance is because of the difference between bulk and single cell RNA-seq data because I likely I have two major groups of data on pc2. please correct me if I am wrong. So, supposing bulk RNA-seq as a gold standard for quality control of single cell ran-seq, we conclude single cell RNA-seq is not a good data because two groups are not on top of each other and too separated

pca deseq2 R • 1.3k views
ADD COMMENTlink modified 2.3 years ago by Biostar ♦♦ 20 • written 2.4 years ago by Za130
2

That the two groups are not on top of each other is not a sign of good or bad quality, but a sign of batch effect, which is totally expected with your data.

ADD REPLYlink written 2.4 years ago by Benn8.0k
1

To follow from the obvious batch difference between scRNA-seq and the bulk data, I neither believe you can say that 42% variance is due to time, i.e., when it is not only PC1 that is separating your samples based on time. Indeed, such a statement would be misleading. Time is clearly distributed across both PC1 and PC2 in your PCA bi-plot.

I would not look too much into PCA in terms of your project's conclusions. Just use it to get a 'feel' of your data distribution, mainly in terms of batch effects and outliers. Only 'get your teeth into' (i.e. probe further) PCA if you really really understand the method and what it means in relation to the biological extrapolation from your numerical data

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Kevin Blighe67k

I am going to trace cell fate decision by single cell RNA-seq, so I want to know if my data is good so that I compared that with RNA-seq. Then how I can show that my single cell is qualified enough to show cell fate decision?

This is distribution of read counts in bulk and single cell

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Za130

See: How to add images to a Biostars post

ADD REPLYlink written 2.4 years ago by RamRS30k

Sorry when I calculated Pearson correlation for each time point between bulk and single seq, there was a good correlation between them so can I claim that single cell data is good?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Za130

How did you conduct the correlation, exactly? - on a gene- or sample-wise basis? Did you derive p-values from the Pearson coefficient?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Kevin Blighe67k

Thank you, this is a picture on correlation between a 11000 genes between 0 hour in bulk and single cell with significant p-value

I just wonder can I use correlation to say my data is good or PCA will say everything?![enter image description here][1]

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Za130
1

The correlation looks good but I cannot see exactly how you've done it.

If you want to merge these data together, then you will obviously have to include ExperimentType (scRNA | bulkRNA) as a covariate in all statistical tests that you do.

  • Your PCA bi-plot indicates that there is a large batch effect between 2 sets of your samples (presumably the 2 'strata' in your bi-plot relate to scRNA-seq and bulk RNA-seq samples)
  • The correlation values indicate that, despite batch differences, the 'patterns' and consistency (in terms of what goes up and down) of the expression values in these datasets is good
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Kevin Blighe67k

Thanks a lot, I don't want to merged two datasets, rather I just want to verify that my single cell RNA-seq data is qualified enough, that is why I have used bulk RNA-seq as a gold standard for this comparison.

ADD REPLYlink written 2.4 years ago by Za130
1

It looks fine, on face value, i.e., the scRNA-seq data. Obviously I'm limited by what I can see here.

ADD REPLYlink written 2.4 years ago by Kevin Blighe67k

Thank you, actually based on my understanding PCA will separate distinct groups from each other where here 9 samples in single cell RNA-seq (I pooled cells in each time point to simulate a bulk RNAseq) and 9 sample from bulk RNA-seq are clearly separated that tells they are different although correlated. normalisation shows that the samples now are close to each other but whatever something makes real bulk RNAS-seq and real bulk RNA-seq coming from pooling the cells separated. I thought may be that is because of read depth but when I looked at the read distribution of differentially expressed genes (lets says 2 hour time point in real bulk RNA-seq vs 2 hour time point in simulated bulk RNA-seq coming from pooling the cells), these genes are not among the genes with very low read counts so that only read death is not the reason behind this sort of strata in PCA. samples by _T are from pooled cells into a simulated bulk and samples with _R are real bulk RNA-seq. I though if single cell data is a good representation of bulk RNA-seq, so pooled cells should not be separated from the corresponding time point from real bulk RNA-seq while they are. I am not sure how to prove if single cell data is good enough or how to explain this separation in PCA

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Za130
1

PCA is fundamentally based on covariance, which will be indirectly influenced by differences in read-depths.

Your best bet is to analyse your bulk and single-cell data separately, and to treat them as entirely independent studies. Different information can be extracted from single cell data that cannot be readily extracted from bulk.

ADD REPLYlink written 2.4 years ago by Kevin Blighe67k

Excuse me, I just noticed that when I am simply normalise my matrices by DESeq2 the correlation between each time point for bulk RNA-seq and simulated RNA-seq (from pooling cells in each time point) are correlated about to 40% but when after normalisation I used log2 of data, the correlation goes to 80%. Do you think I should say my correlation is 80% or 40%?? I mean if this is 40% this is very frustrating but after log transformation it is promising. I don't know what happened if I report the correlation after log transformation. Am I cheating in reporting correlation?

ADD REPLYlink written 2.4 years ago by Za130

How about if you do the regularised log transformation, instead of just log2?

ADD REPLYlink written 2.4 years ago by Kevin Blighe67k

Thank you, I did not try that yet, but generally you think reporting correlation after log2 is cheating?

ADD REPLYlink written 2.4 years ago by Za130
1

'Cheating' is a strong word. What would be incorrect would be to 'mislead' your audience by making an incorrect inference from the correlation values that you're getting. For example, I would simply not give much importance to these correlation values and would therefore not report them.

As I mentioned, I would treat the bulk- and single cell RNA-seq data as entirely independent experiments and not try to 'merge' them together. They are fundamentally different.

ADD REPLYlink written 2.4 years ago by Kevin Blighe67k

Sorry, you think correcting batch effect for these two different sample preparation methods (single cell and bulk RNA-seq) could push them to be similar?

ADD REPLYlink written 2.4 years ago by Za130
1

bulk- and single cell RNA-seq are very different and can not be analysed together. There are no valid batch correcting methods which can fix that.

ADD REPLYlink written 2.4 years ago by WouterDeCoster44k

Sorry, inside my single cell RNA-seq data I have one time point sequenced by Fluidigm C1 and 8 time points by iCELL8. I noticed a big difference in read counts between time points as Fluidigm C1 has given me more read counts than iCELL8. I think here batch effect correction come in handy. But do you know please why Fluidigm C1 give more read counts than icell8? whatever I am googling nothing clear I found

ADD REPLYlink written 2.4 years ago by Za130
1

Well, they are different technologies, so, they are measuring expression on different scales, undoubtedly. You should definitely assume that there will be a batch effect between these data-types, and correct for it appropriately if you are planning to merge these togeter.

May I ask how on Earth an experimental design was devised that includes data from 2 different platforms?

ADD REPLYlink written 2.4 years ago by Kevin Blighe67k

Za, this is the third time you're adding an image improperly. I've cleaned up your latest addition, but please clean up the rest or I will have to temporarily close the post until you do. Thank you.

ADD REPLYlink written 2.4 years ago by RamRS30k
1

Check out this detailed PC analysis article by Markus Ringnér,

https://www.nature.com/articles/nbt0308-303

ADD REPLYlink written 2.3 years ago by Arup Ghosh2.7k

Hello Za!

I've closed the post until the image links are corrected.

Please see C: Interpretation of PCA and How to add images to a Biostars post

Once the images are corrected, I will reopen the post.

Thank you

Edit: A mod has corrected the post for you, I'll reopen it now.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by RamRS30k

Please do not delete posts, especially when they have received comments/answers.

ADD REPLYlink written 2.4 years ago by genomax92k

Sorry, but I was not able to figure out what this PCA plot says finally that is why I thought to remove this post. However I am sorry and thank you for adjusting my pictures

ADD REPLYlink written 2.4 years ago by Za130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1358 users visited in the last hour