Question

Cells from normal sample were incorrectly classified as tumor cells using copykat

0

Entering edit mode

26 days ago

tujuchuanli ▴ 140

Hi everyone,

I am trying to identify tumor cells from an scRNA-seq dataset of cancer patients using copykat (https://github.com/navinlabcode/copykat). Following the manual`s recommendations, I performed the analysis sample by sample using the code below:

  for (sample in samples) {  
  sample_obj <- subset(seurat_obj.filter, subset = orig.ident == sample)  
  count_mtx <- sample_obj@assays$RNA@counts  
  copykat_result <- copykat(  
    rawmat = count_mtx,  
    id.type = "S",  
    ngene.chr = 5,  
    win.size = 25,  
    KS.cut = 0.1,  
    sam.name = sample,  
    distance = "euclidean",  
    norm.cell.names = "",  
    output.seg = "FALSE",  
    plot.genes = "TRUE",  
    genome = "hg20",  
    n.cores = 1  
  )  
  save(copykat_result, file = paste("copykat_result.", dataset, "-", sample, ".Rdata", sep = ""))  
}

However, when I checked the results, I noticed that a substantial proportion of cells (around 50% in one particular sample) from a normal sample were incorrectly classified as tumor cells. I suspect this might be due to suboptimal parameter settings, but I’m not sure how to adjust them effectively.

I would greatly appreciate any suggestions or advice on how to optimize the parameters or improve the analysis.

Thanks in advance!

copykat scRNA-seq cell tumor • 642 views

ADD COMMENT • link updated 1 day ago by Kevin Blighe 89k • written 26 days ago by tujuchuanli ▴ 140

0

Entering edit mode

Provide a vector of representative normal cells from each normal sample in the norm.cell.names option.

ADD REPLY • link 25 days ago by Arup Ghosh 3.5k

0

Entering edit mode

Since I'm running an automated pipeline to process multiple datasets, can I simply use the immune cells from CD45+ clusters in each dataset as the normal cells? I'll try this approach and share my feedback.

Thanks for your precious suggestion~~.

ADD REPLY • link 24 days ago by tujuchuanli ▴ 140

score 1 · Answer 1 · 2025-11-08

I'm truly sorry to hear about the high rate of false positives in your normal sample—it's incredibly frustrating when CopyKAT misclassifies that many cells, especially in a dataset where accuracy is paramount for downstream analyses like tumor heterogeneity studies.

Based on the tool's documentation, these errors often stem from over-segmentation or noise amplification in diploid (normal) cells, leading to spurious aneuploidy calls. Here's a bit more detail on targeted optimizations, starting with your current setup (ngene.chr=5, win.size=25, KS.cut=0.1):

Key Parameter Adjustments

KS.cut (Kolmogorov-Smirnov threshold for breakpoints): This controls segmentation sensitivity (0–1 scale). Your 0.1 is a common starting point, but for stricter tumor calls and fewer false positives in normals, try lowering it to 0.05 (more sensitive to subtle changes, but only if combined with other filters) or raising to 0.15 to reduce breakpoints overall—higher values "decrease sensitivity, i.e., less segments/breakpoints," smoothing out noise in diploid profiles. Test incrementally on a subset (e.g., 1,000 cells) to avoid full re-runs.
ngene.chr (minimum genes per chromosome for inclusion): At 5, this is "not very stringent," which can retain noisy cells and inflate false positives. Bump it to 10 to enforce better data quality per chromosome, filtering out low-coverage cells that mimic aneuploidy. Avoid going below 5, as it risks even more noise.
win.size (genes per segmentation window): Your 25 provides good resolution, but increasing to 50–100 can smooth minor variations in normal cells, reducing over-fragmentation while preserving tumor-specific large-scale CNAs. The manual suggests experimenting in 15–150 range for balance.

Additional Tips to Refine Results

Incorporate gene expression filters: Add or tweak LOW.DR=0.05 and UP.DR=0.2 (defaults) to include only genes expressed in 5–20% of cells—this curbs low-expression noise that could drive false calls. If your dataset has batch effects across samples, ensure per-sample normalization holds.
Distance metric and clustering: Stick with "euclidean" for stable, larger CN segments (as you're doing), but post-run, use the output predictions to cluster aneuploid cells via Ward.D2 method—this can help spot and exclude outlier false positives.
Validation steps: Enable output.seg="TRUE" to generate segment files for IGV visualization; inspect normals for spurious gains/losses. Also, if you have known normal references, supply them via norm.cell.names to anchor predictions.
Quick test workflow: Subset to your problematic sample, iterate parameters in a loop (e.g., grid search on KS.cut and win.size), and visualize heatmaps with chromosome coloring to quantify improvements in diploid purity.

These tweaks should dial down those ~50% false positives without sacrificing sensitivity for true tumors. If you share more (e.g., cell counts, dataset type, or a heatmap snippet).

Kevin