I have bulk RNA-seq and single cell RNA-seq data on the same organism in 9 time points (2h, 4h, ..., 16h) in two replications. I plotted PCA of two datasets (R and T). Based on this picture my interpretation is this: 42% of variance is because of time because likely I have 9 groups of samples (time points) during the time and 34% of variance is because of the difference between bulk and single cell RNA-seq data because I likely I have two major groups of data on pc2. please correct me if I am wrong. So, supposing bulk RNA-seq as a gold standard for quality control of single cell ran-seq, we conclude single cell RNA-seq is not a good data because two groups are not on top of each other and too separated
That the two groups are not on top of each other is not a sign of good or bad quality, but a sign of batch effect, which is totally expected with your data.
To follow from the obvious batch difference between scRNA-seq and the bulk data, I neither believe you can say that 42% variance is due to time, i.e., when it is not only PC1 that is separating your samples based on time. Indeed, such a statement would be misleading. Time is clearly distributed across both PC1 and PC2 in your PCA bi-plot.
I would not look too much into PCA in terms of your project's conclusions. Just use it to get a 'feel' of your data distribution, mainly in terms of batch effects and outliers. Only 'get your teeth into' (i.e. probe further) PCA if you really really understand the method and what it means in relation to the biological extrapolation from your numerical data
I am going to trace cell fate decision by single cell RNA-seq, so I want to know if my data is good so that I compared that with RNA-seq. Then how I can show that my single cell is qualified enough to show cell fate decision?
This is distribution of read counts in bulk and single cell
See: How to add images to a Biostars post
Sorry when I calculated Pearson correlation for each time point between bulk and single seq, there was a good correlation between them so can I claim that single cell data is good?
How did you conduct the correlation, exactly? - on a gene- or sample-wise basis? Did you derive p-values from the Pearson coefficient?
Thank you, this is a picture on correlation between a 11000 genes between 0 hour in bulk and single cell with significant p-value
I just wonder can I use correlation to say my data is good or PCA will say everything?![enter image description here]
The correlation looks good but I cannot see exactly how you've done it.
If you want to merge these data together, then you will obviously have to include
ExperimentType(scRNA | bulkRNA) as a covariate in all statistical tests that you do.
Thanks a lot, I don't want to merged two datasets, rather I just want to verify that my single cell RNA-seq data is qualified enough, that is why I have used bulk RNA-seq as a gold standard for this comparison.
It looks fine, on face value, i.e., the scRNA-seq data. Obviously I'm limited by what I can see here.
Thank you, actually based on my understanding PCA will separate distinct groups from each other where here 9 samples in single cell RNA-seq (I pooled cells in each time point to simulate a bulk RNAseq) and 9 sample from bulk RNA-seq are clearly separated that tells they are different although correlated. normalisation shows that the samples now are close to each other but whatever something makes real bulk RNAS-seq and real bulk RNA-seq coming from pooling the cells separated. I thought may be that is because of read depth but when I looked at the read distribution of differentially expressed genes (lets says 2 hour time point in real bulk RNA-seq vs 2 hour time point in simulated bulk RNA-seq coming from pooling the cells), these genes are not among the genes with very low read counts so that only read death is not the reason behind this sort of strata in PCA. samples by _T are from pooled cells into a simulated bulk and samples with _R are real bulk RNA-seq. I though if single cell data is a good representation of bulk RNA-seq, so pooled cells should not be separated from the corresponding time point from real bulk RNA-seq while they are. I am not sure how to prove if single cell data is good enough or how to explain this separation in PCA
PCA is fundamentally based on covariance, which will be indirectly influenced by differences in read-depths.
Your best bet is to analyse your bulk and single-cell data separately, and to treat them as entirely independent studies. Different information can be extracted from single cell data that cannot be readily extracted from bulk.
Excuse me, I just noticed that when I am simply normalise my matrices by DESeq2 the correlation between each time point for bulk RNA-seq and simulated RNA-seq (from pooling cells in each time point) are correlated about to 40% but when after normalisation I used log2 of data, the correlation goes to 80%. Do you think I should say my correlation is 80% or 40%?? I mean if this is 40% this is very frustrating but after log transformation it is promising. I don't know what happened if I report the correlation after log transformation. Am I cheating in reporting correlation?
How about if you do the regularised log transformation, instead of just log2?
Thank you, I did not try that yet, but generally you think reporting correlation after log2 is cheating?
'Cheating' is a strong word. What would be incorrect would be to 'mislead' your audience by making an incorrect inference from the correlation values that you're getting. For example, I would simply not give much importance to these correlation values and would therefore not report them.
As I mentioned, I would treat the bulk- and single cell RNA-seq data as entirely independent experiments and not try to 'merge' them together. They are fundamentally different.
Sorry, you think correcting batch effect for these two different sample preparation methods (single cell and bulk RNA-seq) could push them to be similar?
bulk- and single cell RNA-seq are very different and can not be analysed together. There are no valid batch correcting methods which can fix that.
Sorry, inside my single cell RNA-seq data I have one time point sequenced by Fluidigm C1 and 8 time points by iCELL8. I noticed a big difference in read counts between time points as Fluidigm C1 has given me more read counts than iCELL8. I think here batch effect correction come in handy. But do you know please why Fluidigm C1 give more read counts than icell8? whatever I am googling nothing clear I found
Well, they are different technologies, so, they are measuring expression on different scales, undoubtedly. You should definitely assume that there will be a batch effect between these data-types, and correct for it appropriately if you are planning to merge these togeter.
May I ask how on Earth an experimental design was devised that includes data from 2 different platforms?
Za, this is the third time you're adding an image improperly. I've cleaned up your latest addition, but please clean up the rest or I will have to temporarily close the post until you do. Thank you.
Check out this detailed PC analysis article by Markus Ringnér,
I've closed the post until the image links are corrected.
Please see C: Interpretation of PCA and How to add images to a Biostars post
Once the images are corrected, I will reopen the post.
Edit: A mod has corrected the post for you, I'll reopen it now.
Please do not delete posts, especially when they have received comments/answers.
Sorry, but I was not able to figure out what this PCA plot says finally that is why I thought to remove this post. However I am sorry and thank you for adjusting my pictures