Question

The interpretation of PCA

0

Entering edit mode

6.2 years ago

Za ▴ 140

Hi,

I have bulk RNA-seq and single cell RNA-seq data on the same organism in 9 time points (2h, 4h, ..., 16h) in two replications. I plotted PCA of two datasets (R and T). Based on this picture my interpretation is this: 42% of variance is because of time because likely I have 9 groups of samples (time points) during the time and 34% of variance is because of the difference between bulk and single cell RNA-seq data because I likely I have two major groups of data on pc2. please correct me if I am wrong. So, supposing bulk RNA-seq as a gold standard for quality control of single cell ran-seq, we conclude single cell RNA-seq is not a good data because two groups are not on top of each other and too separated

DESeq2 R PCA • 3.8k views

ADD COMMENT • link updated 6.1 years ago by Biostar 20 • written 6.2 years ago by Za ▴ 140

2

Entering edit mode

That the two groups are not on top of each other is not a sign of good or bad quality, but a sign of batch effect, which is totally expected with your data.

ADD REPLY • link 6.2 years ago by Benn 8.3k

1

Entering edit mode

To follow from the obvious batch difference between scRNA-seq and the bulk data, I neither believe you can say that 42% variance is due to time, i.e., when it is not only PC1 that is separating your samples based on time. Indeed, such a statement would be misleading. Time is clearly distributed across both PC1 and PC2 in your PCA bi-plot.

I would not look too much into PCA in terms of your project's conclusions. Just use it to get a 'feel' of your data distribution, mainly in terms of batch effects and outliers. Only 'get your teeth into' (i.e. probe further) PCA if you really really understand the method and what it means in relation to the biological extrapolation from your numerical data

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

I am going to trace cell fate decision by single cell RNA-seq, so I want to know if my data is good so that I compared that with RNA-seq. Then how I can show that my single cell is qualified enough to show cell fate decision?

This is distribution of read counts in bulk and single cell

ADD REPLY • link 6.2 years ago by Za ▴ 140

0

Entering edit mode

See: How to add images to a Biostars post

ADD REPLY • link 6.2 years ago by Ram 44k

0

Entering edit mode

Sorry when I calculated Pearson correlation for each time point between bulk and single seq, there was a good correlation between them so can I claim that single cell data is good?

ADD REPLY • link 6.2 years ago by Za ▴ 140

0

Entering edit mode

How did you conduct the correlation, exactly? - on a gene- or sample-wise basis? Did you derive p-values from the Pearson coefficient?

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you, this is a picture on correlation between a 11000 genes between 0 hour in bulk and single cell with significant p-value

I just wonder can I use correlation to say my data is good or PCA will say everything?![enter image description here][1]

ADD REPLY • link 6.2 years ago by Za ▴ 140

1

Entering edit mode

The correlation looks good but I cannot see exactly how you've done it.

If you want to merge these data together, then you will obviously have to include ExperimentType (scRNA | bulkRNA) as a covariate in all statistical tests that you do.

Your PCA bi-plot indicates that there is a large batch effect between 2 sets of your samples (presumably the 2 'strata' in your bi-plot relate to scRNA-seq and bulk RNA-seq samples)
The correlation values indicate that, despite batch differences, the 'patterns' and consistency (in terms of what goes up and down) of the expression values in these datasets is good

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks a lot, I don't want to merged two datasets, rather I just want to verify that my single cell RNA-seq data is qualified enough, that is why I have used bulk RNA-seq as a gold standard for this comparison.

ADD REPLY • link 6.2 years ago by Za ▴ 140

1

Entering edit mode

It looks fine, on face value, i.e., the scRNA-seq data. Obviously I'm limited by what I can see here.

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you, actually based on my understanding PCA will separate distinct groups from each other where here 9 samples in single cell RNA-seq (I pooled cells in each time point to simulate a bulk RNAseq) and 9 sample from bulk RNA-seq are clearly separated that tells they are different although correlated. normalisation shows that the samples now are close to each other but whatever something makes real bulk RNAS-seq and real bulk RNA-seq coming from pooling the cells separated. I thought may be that is because of read depth but when I looked at the read distribution of differentially expressed genes (lets says 2 hour time point in real bulk RNA-seq vs 2 hour time point in simulated bulk RNA-seq coming from pooling the cells), these genes are not among the genes with very low read counts so that only read death is not the reason behind this sort of strata in PCA. samples by _T are from pooled cells into a simulated bulk and samples with _R are real bulk RNA-seq. I though if single cell data is a good representation of bulk RNA-seq, so pooled cells should not be separated from the corresponding time point from real bulk RNA-seq while they are. I am not sure how to prove if single cell data is good enough or how to explain this separation in PCA

ADD REPLY • link 6.2 years ago by Za ▴ 140

1

Entering edit mode

PCA is fundamentally based on covariance, which will be indirectly influenced by differences in read-depths.

Your best bet is to analyse your bulk and single-cell data separately, and to treat them as entirely independent studies. Different information can be extracted from single cell data that cannot be readily extracted from bulk.

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Excuse me, I just noticed that when I am simply normalise my matrices by DESeq2 the correlation between each time point for bulk RNA-seq and simulated RNA-seq (from pooling cells in each time point) are correlated about to 40% but when after normalisation I used log2 of data, the correlation goes to 80%. Do you think I should say my correlation is 80% or 40%?? I mean if this is 40% this is very frustrating but after log transformation it is promising. I don't know what happened if I report the correlation after log transformation. Am I cheating in reporting correlation?

ADD REPLY • link 6.2 years ago by Za ▴ 140

0

Entering edit mode

How about if you do the regularised log transformation, instead of just log2?

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you, I did not try that yet, but generally you think reporting correlation after log2 is cheating?

ADD REPLY • link 6.2 years ago by Za ▴ 140

1

Entering edit mode

'Cheating' is a strong word. What would be incorrect would be to 'mislead' your audience by making an incorrect inference from the correlation values that you're getting. For example, I would simply not give much importance to these correlation values and would therefore not report them.

As I mentioned, I would treat the bulk- and single cell RNA-seq data as entirely independent experiments and not try to 'merge' them together. They are fundamentally different.

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Sorry, you think correcting batch effect for these two different sample preparation methods (single cell and bulk RNA-seq) could push them to be similar?

ADD REPLY • link 6.2 years ago by Za ▴ 140

1

Entering edit mode

bulk- and single cell RNA-seq are very different and can not be analysed together. There are no valid batch correcting methods which can fix that.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry, inside my single cell RNA-seq data I have one time point sequenced by Fluidigm C1 and 8 time points by iCELL8. I noticed a big difference in read counts between time points as Fluidigm C1 has given me more read counts than iCELL8. I think here batch effect correction come in handy. But do you know please why Fluidigm C1 give more read counts than icell8? whatever I am googling nothing clear I found

ADD REPLY • link 6.2 years ago by Za ▴ 140

1

Entering edit mode

Well, they are different technologies, so, they are measuring expression on different scales, undoubtedly. You should definitely assume that there will be a batch effect between these data-types, and correct for it appropriately if you are planning to merge these togeter.

May I ask how on Earth an experimental design was devised that includes data from 2 different platforms?

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Za, this is the third time you're adding an image improperly. I've cleaned up your latest addition, but please clean up the rest or I will have to temporarily close the post until you do. Thank you.

ADD REPLY • link 6.2 years ago by Ram 44k

1

Entering edit mode

Check out this detailed PC analysis article by Markus Ringnér,

https://www.nature.com/articles/nbt0308-303

ADD REPLY • link 6.1 years ago by Arup Ghosh 3.2k

0

Entering edit mode

Hello Za!

I've closed the post until the image links are corrected.

Please see C: Interpretation of PCA and How to add images to a Biostars post

Once the images are corrected, I will reopen the post.

Thank you

Edit: A mod has corrected the post for you, I'll reopen it now.

ADD REPLY • link 6.2 years ago by Ram 44k

0

Entering edit mode

Please do not delete posts, especially when they have received comments/answers.

ADD REPLY • link 6.2 years ago by GenoMax 145k

0

Entering edit mode

Sorry, but I was not able to figure out what this PCA plot says finally that is why I thought to remove this post. However I am sorry and thank you for adjusting my pictures

ADD REPLY • link 6.2 years ago by Za ▴ 140