Question: How many PCs should be considered for downstream analyses?
0
gravatar for bioinforesearchquestions
5 weeks ago by
United States
bioinforesearchquestions270 wrote:

Hi All,

I have two groups WT and KO.

As per the Jackstraw plot, ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line).

  1. How to interpret the JackStraw plot. How come even the PCs with p-value =1 is above the dashed line.

  2. PC 5 has pvalue "1". Do I need to consider the PCs which has only pvalue <0.05 (PC : 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15) for the downstream analyses?

As per the Elbow plot, looks like at PC 34 the standard deviation is touching the ground and staying constant.

So how many PCs should I consider for the downstream analyses like (find neighbors, find clusters and UMAP)?

cond_integrated <- FindNeighbors(object = cond_integrated, dims = ?)
cond_integrated <- FindClusters(object = cond_integrated)
cond_integrated <- RunUMAP(cond_integrated, reduction = "pca", dims = ?)

As I change the number of dimensions each time, I am getting different UMAP clustering.

merged_cond <- merge(x = WT_seurat_obj, y = KO_seurat_obj, add.cell.id = c("WT","KO"))

# filtered the merged_con based on mito, etc
filtered_cond_seurat

# split seurat object by condition from filtered_cond_seurat

for (i in 1:length(split_cond)) {
  split_cond[[i]] <- NormalizeData
  split_cond[[i]] <- CellCycleScoring
  split_cond[[i]] <- SCTransform
}
Obtained integ_features from SelectIntegrationFeatures using split_cond seurat object
Obtained anchor features using PrepSCTIntegration
Obtained integ_anchors using FindIntegrationAnchors and SCT normalization method
Obtained cond_integrated seurat object using IntegrateData

cond_integrated <- RunPCA(object = cond_integrated)

DimHeatmap(cond_integrated, dims = 1:15, cells = 500, balanced = TRUE)

cond_integrated <- JackStraw(cond_integrated, num.replicate = 100, dims=50)

cond_integrated <- ScoreJackStraw(cond_integrated, dims = 1:50)

JackStrawPlot(cond_integrated, dims = 1:50)

ElbowPlot(object = cond_integrated, ndims = 50)

PC-heatmaps

Jackstraw-Plot

Elbowplot

ADD COMMENTlink modified 5 weeks ago by jared.andrews075.3k • written 5 weeks ago by bioinforesearchquestions270
3
gravatar for jared.andrews07
5 weeks ago by
jared.andrews075.3k
St. Louis, MO
jared.andrews075.3k wrote:

When using SCTransform, this matters somewhat less, as it tends to be more robust and handle noise better. As such, you can provide a lot of PCs without introducing undue variation. I generally start with 30, but have gone up to 50 and noticed little difference. The authors generally recommend using more than the standard workflow for reasons outlined in the SCTransform vignette:

Why can we choose more PCs when using sctransform?

In the standard Seurat workflow we focus on 10 PCs for this dataset, though we highlight that the results are similar with higher settings for this parameter. Interestingtly, we’ve found that when using sctransform, we often benefit by pushing this parameter even higher. We believe this is because the sctransform workflow performs more effective normalization, strongly removing technical effects from the data.

Even after standard log-normalization, variation in sequencing depth is still a confounding factor (see Figure 1), and this effect can subtly influence higher PCs. In sctransform, this effect is substantially mitigated (see Figure 3). This means that higher PCs are more likely to represent subtle, but biologically relevant, sources of heterogeneity – so including them may improve downstream analysis.

In addition, sctransform returns 3,000 variable features by default, instead of 2,000. The rationale is similar, the additional variable features are less likely to be driven by technical differences across cells, and instead may represent more subtle biological fluctuations. In general, we find that results produced with sctransform are less dependent on these parameters (indeed, we achieve nearly identical results when using all genes in the transcriptome, though this does reduce computational efficiency).

In short, use more than you think you need and try not to overthink it.

ADD COMMENTlink written 5 weeks ago by jared.andrews075.3k

Can you please look into this post of mine related to this analysis? Downsampling one of the sample on the UMAP clustering to match the number of cells of the other group

Thanks Jared for your comments and resources.

I was playing around by changing the number of PCs from 15 to 40. I have attached them below for reference. The overall clustering pattern didn't change drastically. However, there is some change in the orientation of the clustering in each of the UMAP.

dim15-20-25 dim-30-35-40

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by bioinforesearchquestions270
1

However, there is some change in the orientation of the clustering in each of the UMAP.

The orientation change is meaningless as the embedding is stochastic. The overall distribution of clusters is what matters, and that seems to be very similar in all cases.

ADD REPLYlink written 5 weeks ago by Mensur Dlakic4.1k

Thanks, Mensur for adding a point. Can you please look into this post of mine related to this analysis? Downsampling one of the sample on the UMAP clustering to match the number of cells of the other group

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by bioinforesearchquestions270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2180 users visited in the last hour