PCA value for integrated analysis in scRNA
0
0
Entering edit mode
3.9 years ago
ek699 ▴ 10

Hi, I have data from a specific cell from mouse fed with a certain diet. I have 4 datasets in total and they were measured at four different time. My goal is to integrate the four datasets and do integrated analysis for single cell RNA seq. I have been using Seurat and referring to the vignette : https://satijalab.org/seurat/v3.1/immune_alignment.html

This is my first time to do integrated analysis for scRNA seq. I first did a clustering analysis with one out of the four datasets (around 3,000 cells after pre-processing) before integrating all of 4 datasets. For the specific one dataset, I included all 40 significant PCs to do UMAP clustering.

After that, I tried integrating 2 out of the 4 datasets (around 4,700 cells after integrating) just to see if how Seurat would work, and when I did

JackStrawPlot(two.combined, dims = 1:50)

It shows, all 50 PCs (or even more than 50) are significant.

From this, I guess that if I get to integrate all 4 datasets, the number of significant PC would be very likely to be different from the observed values so far (40 and 50).

Then, in this case,

  1. Should I just ignore the previous PC values and just integrate all of 4 datasets and do JackStrawPlot to determine how many PCs should be included for the integrated analysis?
  2. Is there any relationship between PC and CC value that is used in FindIntegrationAnchors? For example, the dimension in FindIntegrationAnchors should be greater than or equal to PC value in RUNUMAP? If not, is there any specific way to determine an ideal CC value like Jackstraw plot or Elbow plot that we use to try to find an ideal PC value?

Thank you!

scRNA Seurat integration PCA • 2.0k views
ADD COMMENT
1
Entering edit mode

I would simply go with the default (probably 50 or so) and then see if you see the clustering behaviour you are looking for. There is not (from what I've read) a waterproof strategy to choose the "right" number of PCs. Eventually, the whole integration procedure is only used for clustering and visualization and if you get a reasonable cluster landscape that you can work with then go with it. I would then proceed with downstream analysis and only go back and change PC choice and clustering parameters of you end up with clusters that make no sense. Remember that all other analysis (differential analysis etc) is based on the unintegrated values.

ADD REPLY
0
Entering edit mode

When you have a lot of cells (>10,000), the "significant PCs" calculations seem to be irrelevant since you always get too many.

I would use the same number of dimensions for UMAP as you use for integration. UMAP is used to visualize the integration results. If you are using different inputs, you are not really doing that.

ADD REPLY
0
Entering edit mode

Thank you for the prompt response! I just edited my post to include the number of cells. When I integrate all of four datasets, it is less than 10,000 cells, though. So, I can just simply integrate four datasets and only care about the integrated list's Jackstraw plot results to define PC value, right? When you said "same number of dimension", did you refer to the CC value that I use for FindIntegrationAnchors? If so, how would you find an ideal number of dimension when you integrate the datasets?

ADD REPLY
1
Entering edit mode

3,000 is still a lot. My main point was that I frequently see over 50 significant PCs, especially with larger datasets, but I don't think I have ever seen any publications use more than that.

When you said "same number of dimension", did you refer to the CC value that I use for FindIntegrationAnchors?

Yes.

If so, how would you find an ideal number of dimension when you integrate the datasets?

You look at the output, including the UMAP.

ADD REPLY
0
Entering edit mode

Thank you so much. For example, if it turns out 60 PCs are significant for the integrated object, then what number of PC should I start trying with? I don't have biology background and I am having a hard time to make a decision to find a good parameter value when looking at the outputs. Is there any tips for me to say, "this output looks good (or bad)", when I am looking at the outputs?

ADD REPLY
0
Entering edit mode

You need to understand the data to interpret it. There is not some formula that will tell you the correct number. If there was, Seurat would not ask you to look at the significant components.

It's okay if you don't have a biology background. Someone with a biology background paid a lot of money to do the experiment. That person should not expect someone who has no idea what they are looking at to generate the results. You can ask them what the experiment is and how to interpret the data.

ADD REPLY
0
Entering edit mode

Your point is absolutely right. I may need to frequently communicate with biologists to drive the best result for them.

ADD REPLY

Login before adding your answer.

Traffic: 2925 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6